Files and Metadata: Ingress

Overview

This guide walks you through the process of sharing your research data through the MC² Center. We understand that data sharing can seem complex, but we're here to help at every step. Our goal is to provide valuable research data that is properly organized, well-documented, and accessible to the scientific community in a way that aligns with FAIR principles (Findable, Accessible, Interoperable, and Reusable).

The process typically takes 2-4 weeks, depending on the complexity of your data and any access requirements. Throughout this time, we'll work together to:

Organize your data files effectively
Document your data with metadata (descriptive information that helps others find and understand your data)
Confirm appropriate access controls are in place
Make your data discoverable through the Cancer Complexity Knowledge Portal

Before You Begin

You'll need:

A Synapse account with certification
Access to your grant-specific project
Python with synapseclient installed (for file uploads)

Our team will provide:

Access to metadata templates
Guidance on data organization
Support throughout the process

Instructions

Identify Files to Share

🔷 Your Role: Review and gather your data files. Consider what other researchers would need to understand and reproduce your work.

Common data types include:

Raw/minimally processed data:

FASTQ files from sequencing
TIFF/OME-TIFF images
Raw instrument outputs

Processed data:

BAM files
BigWig files
Registered/segmented images

Analysis outputs:

Expression matrices
Cell-gene count matrices
Peaks files
Segmentation masks
Feature counts

Supporting documentation:

Instrument configuration files
Cell type atlases
Collection parameters
Protocol documentation

Remember: Including supporting files helps others understand your experimental setup and analysis pipeline.

Confirm that all relevant file types have been correctly identified and that the files are organized by type.

Prepare Access

🔷 Your Role: Verify you have the necessary access to begin sharing data.

You'll need:

Active Synapse account with certification
Access to your grant-specific team
Python with synapseclient installed (for uploads)

Don't worry if you're new to Synapse - we have documentation to help you get started.

We can help confirm that your account is certified and that your team access has been granted.

Submit Study Information

🔷 Your Role: Document your study and dataset details. This information helps others understand the context and significance of your data.

You'll need to:

a. Access the sharing plan guidance document linked in the "sharing_plans" folder:

The guidance document helps you record information about your submission and includes links to the sheets noted below.

If your data must be shared under any access requirements, the guidance document ensures that critical governance information is properly captured.

b. The Data Sharing Plan (DSP) in the "sharing_plans" folder:

Outlines sharing restrictions
Documents governance requirements
Specifies access controls
Lists dataset identifiers
Links to IRB documentation if needed

c. The Study Sheet in the "study" folder, which provides overall context for your research.

Take your time with these documents - they're crucial for helping others understand and use your data appropriately.

Confirm that all required fields are completed

Notify the MC² Center

🔷 Your Role: Let us know when you're ready for review.

Contact the MC² Center data manager to:

Confirm your documentation is complete
Request folder setup for your files
Schedule a review meeting if needed

🔶 What We'll Do: We'll review your information to:

Verify documentation completeness
Identify any potential access control requirements or storage considerations to discuss
Plan appropriate structure for releasing your data

This is a collaborative step - we're here to help be sure everything is properly organized.

Review Storage and Sharing Requirements

🔶 What We'll Do: We'll assess if the following conditions may apply to your data:

Access requirements (including Data Use Agreements)
IRB documentation (for sharing human-derived data)
Storage in a specific repository

We'll schedule a meeting if:

You indicate that your data requires special access controls
The type of data you intend to share typically has access requirements applied
Documentation needs discussion
Your data might need to be stored in an alternative repository

This step confirms your data is shared appropriately while protecting sensitive information.

Create Folder Structure

🔶 What We'll Do: We'll set up a clear organization for your files:

CODE

Project/
├── data/                  # Your research data
├── study/                 # Study information
├── biospecimen/           # Sample metadata
├── model/                 # Model system data
├── sharing_plans/         # Data sharing info
├── governance/            # IRB documentation

The ‘data’ directory is created as a single container. You can upload your files and existing directory structure here.

We'll link all necessary templates:

Study template → study/
DSP template → sharing_plans/
Component templates → relevant folders

This structure helps keep your data organized and accessible.

Upload Files

🔷 Your Role: Upload your files using the Synapse client.

Preferred method: Create and use a manifest file

Build your manifest.tsv:

Command:

CODE

synapse manifest --parent-id syn123 --manifest-file OUTPUT PATH

Parameters:

PATH : A path to a file or folder whose manifest will be generated
--parent-id : Synapse ID of project or folder where to upload data
--manifest-file : A TSV output file path where the generated manifest is stored (defaults to stdout if not specified)

Example:

CODE

synapse manifest --parent-id syn789012 --manifest-file ./my_manifest.tsv ./data_folder

Upload using the manifest:

Command:

CODE

synapse sync FILE [--dryRun] [--sendMessages] [--retries INT]

Parameters:

FILE : A tsv file with file locations and metadata to be pushed to Synapse
--dryRun: Perform validation without uploading
--sendMessages : Send notifications via Synapse messaging (email) at specific intervals, on errors and on completion
--retries: Number of retries for failed uploads

Example:

CODE

synapse sync ./my_manifest.tsv

Options:

Use --dryRun to validate your manifest before actually uploading files
Use --sendMessages to receive email notifications about the upload process
Use --retries to specify a different number of retry attempts for failed uploads

Using a manifest file is recommended for:

Multiple files because it is more efficient than individual uploads
Large datasets for ease of handling large transfers
Complex hierarchies to maintain folder structure
Tracking to provide a record of what was uploaded

Confirm that files appear in Synapse and that the file organization matches the plan

Don't worry if this seems technical - we're here to help if you need guidance.

Set Up Datasets

🔶 What We'll Do:

Create Synapse Datasets to package your files, based on the entries in your Data Sharing Plan:

Command structure:

CODE

python build_datasets.py -d [DataDSP filepath or table Synapse Id] -n [Name for DSP CSV output]

Example:

POWERSHELL

python build_datasets.py -d syn45678910 -n updated_dsp.csv

We'll record Dataset identifiers in your sharing plan.

This organization makes your data easily discoverable and citable.

Verify that datasets are successfully created and that the keys are recorded in the manifest.

Review Organization

🔶 What We'll Do: Verify your uploaded content.

We'll confirm that files match the content described in your DSP, related files are grouped in Datasets, and that file names follow standard conventions for discovery and reuse.

We don’t enforce a specific folder structure, however, the following is a basic high-level example of how research data is commonly organized:

BASH

data/
├── raw/              # Original unmodified data
├── processed/        # Data that has been transformed
└── analysis/         # Results and outputs

Your organization may differ based on your research needs and data types. We promote the use of standard naming conventions to make it easy for others to understand and use your data.

What Happens Next

After successful upload, we'll guide your data through these steps:

File Organization and Dataset Review

Files are grouped into datasets. We validate file placement, check naming conventions, verify folder structure, and confirm metadata completeness.

Access Control Setup

Files will be private while preparing Datasets for release. During this time, access restrictions are configured and IRB documentation is verified.

Validation and Release

The release timeline, access requirements, and documentation are finalized. The metadata is validated and incorporated into Datasets for sharing alongside files.

Access Control Notes

Unless additional access requirements are necessary for responsible sharing, data will be released under Open Access.

If your data must be shared under an access requirement or you are unsure if an access requirement should be in place for your shared data, work with an MC2 Center data manager to:

Identify sources of information to help determine applicable access requirements
Document the requirements for each data type in your sharing plan guidance document linked in the "sharing_plans" folder
Log Data Use Ontology (DUO) codes in your Data Sharing Plan:

Use column V "DSP Data Use Codes"

DUO codes are standardized terms that clearly specify how data can be used. Examples of DUO codes include:

NRES (DUO:0000004 - No restrictions)
GRU (DUO:0000042 - General research use)
NPU (DUO:0000045 - Not-for-profit use only)
HMB (DUO:0000006 - Health or medical or biomedical research only)
DS (DUO:0000007 - Disease specific research)
GS (DUO:0000022 - Geographical restriction)

These codes help make sure your data is used according to your requirements and within ethical considerations. The MC² Center will use this information to ensure the appropriate access restrictions are implemented prior to the release of your data.

If you are unsure how to document your data sharing requirements or you’re not sure if your data requires access controls, please contact the MC² Center for guidance.

Need Help?

We're here to support you through this process. Don't hesitate to Contact Us if you have questions or need guidance at any step.

Overview

Before You Begin

Instructions

What Happens Next

Access Control Notes

Need Help?

Additional Resources