Files and Metadata: Ingress
Overview
This guide walks you through the process of sharing your research data through the MC2 Center. We understand that data sharing can seem complex, but we're here to help at every step. Our goal is to provide valuable research data that is properly organized, well-documented, and accessible to the scientific community in a way that aligns with FAIR principles (Findable, Accessible, Interoperable, and Reusable).
The process typically takes 2-4 weeks, depending on the complexity of your data and any access requirements. Throughout this time, we'll work together to:
Organize your data files effectively
Document your data with metadata (descriptive information that helps others find and understand your data)
Confirm appropriate access controls are in place
Make your data discoverable through the Cancer Complexity Knowledge Portal
Before You Begin
You'll need:
A Synapse account with certification
Access to your grant-specific project
Python with synapseclient installed (for file uploads)
Our team will provide:
Access to metadata templates
Guidance on data organization
Support throughout the process
Instructions
Identify Files to Share
π· Your Role: Review and gather your data files. Consider what other researchers would need to understand and reproduce your work.
Common data types include:
Raw/minimally processed data:
FASTQ files from sequencing
TIFF/OME-TIFF images
Raw instrument outputs
Processed data:
BAM files
BigWig files
Registered/segmented images
Analysis outputs:
Expression matrices
Cell-gene count matrices
Peaks files
Segmentation masks
Feature counts
Supporting documentation:
Instrument configuration files
Cell type atlases
Collection parameters
Protocol documentation
Remember: Including supporting files helps others understand your experimental setup and analysis pipeline.
Confirm that all relevant file types have been correctly identified and that the files are organized by type.
Prepare Access
π· Your Role: Verify you have the necessary access to begin sharing data.
You'll need:
Active Synapse account with certification
Access to your grant-specific team
Python with synapseclient installed (for uploads)
Don't worry if you're new to Synapse - we have documentation to help you get started.
We can help confirm that your account is certified and that your team access has been granted.
Submit Study Information
π· Your Role: Document your study and dataset details. This information helps others understand the context and significance of your data.
You'll need to:
a. Access the sharing plan guidance document linked in the "sharing_plans" folder:
The guidance document helps you record information about your submission and includes links to the sheets noted below.
If your data must be shared under any access requirements, the guidance document ensures that critical governance information is properly captured.
b. The Data Sharing Plan (DSP) in the "sharing_plans" folder:
Outlines sharing restrictions
Documents governance requirements
Specifies access controls
Lists dataset identifiers
Links to IRB documentation if needed
c. The Study Sheet in the "study" folder, which provides overall context for your research.
Take your time with these documents - they're crucial for helping others understand and use your data appropriately.
Confirm that all required fields are completed
Notify the MC2 Center
π· Your Role: Let us know when you're ready for review.
Contact the MC2 Center data manager to:
Confirm your documentation is complete
Request folder setup for your files
Schedule a review meeting if needed
πΆ What We'll Do: We'll review your information to:
Verify documentation completeness
Identify any potential access control requirements or storage considerations to discuss
Plan appropriate structure for releasing your data
This is a collaborative step - we're here to help be sure everything is properly organized.
Review Storage and Sharing Requirements
πΆ What We'll Do: We'll assess if the following conditions may apply to your data:
Access requirements (including Data Use Agreements)
IRB documentation (for sharing human-derived data)
Storage in a specific repository
We'll schedule a meeting if:
You indicate that your data requires special access controls
The type of data you intend to share typically has access requirements applied
Documentation needs discussion
Your data might need to be stored in an alternative repository
This step confirms your data is shared appropriately while protecting sensitive information.
Create Folder Structure
πΆ What We'll Do: We'll set up a clear organization for your files:
Project/
βββ data/ # Your research data
βββ study/ # Study information
βββ biospecimen/ # Sample metadata
βββ model/ # Model system data
βββ sharing_plans/ # Data sharing info
βββ governance/ # IRB documentation
The βdataβ directory is created as a single container. You can upload your files and existing directory structure here.
We'll link all necessary templates:
Study template β study/
DSP template β sharing_plans/
Component templates β relevant folders
This structure helps keep your data organized and accessible.
Upload Files
π· Your Role: Upload your files using the Synapse client.
Preferred method: Create and use a manifest file
Build your
manifest.tsv
:
Command:
synapse manifest --parent-id syn123 --manifest-file OUTPUT PATH
Parameters:
PATH
: A path to a file or folder whose manifest will be generated--parent-id
: Synapse ID of project or folder where to upload data--manifest-file
: A TSV output file path where the generated manifest is stored (defaults to stdout if not specified)
Example:
synapse manifest --parent-id syn789012 --manifest-file ./my_manifest.tsv ./data_folder
Upload using the manifest:
Command:
synapse sync FILE [--dryRun] [--sendMessages] [--retries INT]
Parameters:
FILE
: Atsv
file with file locations and metadata to be pushed to Synapse--dryRun
: Perform validation without uploading--sendMessages
: Send notifications via Synapse messaging (email) at specific intervals, on errors and on completion--retries
: Number of retries for failed uploads
Example:
synapse sync ./my_manifest.tsv
Options:
Use
--dryRun
to validate your manifest before actually uploading filesUse
--sendMessages
to receive email notifications about the upload processUse
--retries
to specify a different number of retry attempts for failed uploads
Using a manifest file is recommended for:
Multiple files because it is more efficient than individual uploads
Large datasets for ease of handling large transfers
Complex hierarchies to maintain folder structure
Tracking to provide a record of what was uploaded
Confirm that files appear in Synapse and that the file organization matches the plan
Don't worry if this seems technical - we're here to help if you need guidance.
Set Up Datasets
πΆ What We'll Do:
Create Synapse Datasets to package your files, based on the entries in your Data Sharing Plan:
Command structure:
python build_datasets.py -d [DataDSP filepath or table Synapse Id] -n [Name for DSP CSV output]
Example:
python build_datasets.py -d syn45678910 -n updated_dsp.csv
We'll record Dataset identifiers in your sharing plan.
This organization makes your data easily discoverable and citable.
Verify that datasets are successfully created and that the keys are recorded in the manifest.
Review Organization
πΆ What We'll Do: Verify your uploaded content.
We'll confirm that files match the content described in your DSP, related files are grouped in Datasets, and that file names follow standard conventions for discovery and reuse.
We donβt enforce a specific folder structure, however, the following is a basic high-level example of how research data is commonly organized:
data/
βββ raw/ # Original unmodified data
βββ processed/ # Data that has been transformed
βββ analysis/ # Results and outputs
Your organization may differ based on your research needs and data types. We promote the use of standard naming conventions to make it easy for others to understand and use your data.
What Happens Next
After successful upload, we'll guide your data through these steps:
File Organization and Dataset Review
Files are grouped into datasets. We validate file placement, check naming conventions, verify folder structure, and confirm metadata completeness.
Access Control Setup
Files will be private while preparing Datasets for release. During this time, access restrictions are configured and IRB documentation is verified.
Validation and Release
The release timeline, access requirements, and documentation are finalized. The metadata is validated and incorporated into Datasets for sharing alongside files.
Access Control Notes
Unless additional access requirements are necessary for responsible sharing, data will be released under Open Access.
If your data must be shared under an access requirement or you are unsure if an access requirement should be in place for your shared data, work with an MC2 Center data manager to:
Identify sources of information to help determine applicable access requirements
Document the requirements for each data type in your sharing plan guidance document linked in the "sharing_plans" folder
Log Data Use Ontology (DUO) codes in your Data Sharing Plan:
Use column V "DSP Data Use Codes"
DUO codes are standardized terms that clearly specify how data can be used. Examples of DUO codes include:
NRES (DUO:0000004 - No restrictions)
GRU (DUO:0000042 - General research use)
NPU (DUO:0000045 - Not-for-profit use only)
HMB (DUO:0000006 - Health or medical or biomedical research only)
DS (DUO:0000007 - Disease specific research)
GS (DUO:0000022 - Geographical restriction)
These codes help make sure your data is used according to your requirements and within ethical considerations. The MC2 Center will use this information to ensure the appropriate access restrictions are implemented prior to the release of your data.
If you are unsure how to document your data sharing requirements or youβre not sure if your data requires access controls, please contact the MC2 Center for guidance.
Need Help?
We're here to support you through this process. Don't hesitate to Contact Us if you have questions or need guidance at any step.