Skip to main content
Skip table of contents

Files and Metadata: Ingress


Overview

This guide walks you through the process of sharing your research data through the MC2 Center. We understand that data sharing can seem complex, but we're here to help at every step. Our goal is to provide valuable research data that is properly organized, well-documented, and accessible to the scientific community in a way that aligns with FAIR principles (Findable, Accessible, Interoperable, and Reusable).

The process typically takes 2-4 weeks, depending on the complexity of your data and any access requirements. Throughout this time, we'll work together to:

  • Organize your data files effectively

  • Document your data with metadata (descriptive information that helps others find and understand your data)

  • Confirm appropriate access controls are in place

  • Make your data discoverable through the Cancer Complexity Knowledge Portal

Before You Begin

You'll need:

  • A Synapse account with certification

  • Access to your grant-specific project

  • Python with synapseclient installed (for file uploads)

Our team will provide:

  • Access to metadata templates

  • Guidance on data organization

  • Support throughout the process

Instructions

  1. Identify Files to Share

πŸ”· Your Role: Review and gather your data files. Consider what other researchers would need to understand and reproduce your work.

Common data types include:

Raw/minimally processed data:

  • FASTQ files from sequencing

  • TIFF/OME-TIFF images

  • Raw instrument outputs

Processed data:

  • BAM files

  • BigWig files

  • Registered/segmented images

Analysis outputs:

  • Expression matrices

  • Cell-gene count matrices

  • Peaks files

  • Segmentation masks

  • Feature counts

Supporting documentation:

  • Instrument configuration files

  • Cell type atlases

  • Collection parameters

  • Protocol documentation

Remember: Including supporting files helps others understand your experimental setup and analysis pipeline.

Confirm that all relevant file types have been correctly identified and that the files are organized by type.

  1. Prepare Access

πŸ”· Your Role: Verify you have the necessary access to begin sharing data.

You'll need:

  • Active Synapse account with certification

  • Access to your grant-specific team

  • Python with synapseclient installed (for uploads)

Don't worry if you're new to Synapse - we have documentation to help you get started.

We can help confirm that your account is certified and that your team access has been granted.

  1. Submit Study Information

πŸ”· Your Role: Document your study and dataset details. This information helps others understand the context and significance of your data.

You'll need to:

a. Access the sharing plan guidance document linked in the "sharing_plans" folder:

The guidance document helps you record information about your submission and includes links to the sheets noted below.

If your data must be shared under any access requirements, the guidance document ensures that critical governance information is properly captured.

b. The Data Sharing Plan (DSP) in the "sharing_plans" folder:

  • Outlines sharing restrictions

  • Documents governance requirements

  • Specifies access controls

  • Lists dataset identifiers

  • Links to IRB documentation if needed

c. The Study Sheet in the "study" folder, which provides overall context for your research.

Take your time with these documents - they're crucial for helping others understand and use your data appropriately.

  • Confirm that all required fields are completed

  1. Notify the MC2 Center

πŸ”· Your Role: Let us know when you're ready for review.

Contact the MC2 Center data manager to:

  • Confirm your documentation is complete

  • Request folder setup for your files

  • Schedule a review meeting if needed

πŸ”Ά What We'll Do: We'll review your information to:

  • Verify documentation completeness

  • Identify any potential access control requirements or storage considerations to discuss

  • Plan appropriate structure for releasing your data

This is a collaborative step - we're here to help be sure everything is properly organized.

  1. Review Storage and Sharing Requirements

πŸ”Ά What We'll Do: We'll assess if the following conditions may apply to your data:

  • Access requirements (including Data Use Agreements)

  • IRB documentation (for sharing human-derived data)

  • Storage in a specific repository

We'll schedule a meeting if:

  • You indicate that your data requires special access controls

  • The type of data you intend to share typically has access requirements applied

  • Documentation needs discussion

  • Your data might need to be stored in an alternative repository

This step confirms your data is shared appropriately while protecting sensitive information.

  1. Create Folder Structure

πŸ”Ά What We'll Do: We'll set up a clear organization for your files:

CODE
Project/
β”œβ”€β”€ data/                  # Your research data
β”œβ”€β”€ study/                 # Study information
β”œβ”€β”€ biospecimen/           # Sample metadata
β”œβ”€β”€ model/                 # Model system data
β”œβ”€β”€ sharing_plans/         # Data sharing info
β”œβ”€β”€ governance/            # IRB documentation

The β€˜data’ directory is created as a single container. You can upload your files and existing directory structure here.

We'll link all necessary templates:

  • Study template β†’ study/

  • DSP template β†’ sharing_plans/

  • Component templates β†’ relevant folders

This structure helps keep your data organized and accessible.

  1. Upload Files

πŸ”· Your Role: Upload your files using the Synapse client.

Preferred method: Create and use a manifest file

  1. Build your manifest.tsv:

Command:

CODE
synapse manifest --parent-id syn123 --manifest-file OUTPUT PATH

Parameters:

  • PATH : A path to a file or folder whose manifest will be generated

  • --parent-id : Synapse ID of project or folder where to upload data

  • --manifest-file : A TSV output file path where the generated manifest is stored (defaults to stdout if not specified)

Example:

CODE
synapse manifest --parent-id syn789012 --manifest-file ./my_manifest.tsv ./data_folder
  1. Upload using the manifest:

Command:

CODE
synapse sync FILE [--dryRun] [--sendMessages] [--retries INT]

Parameters:

  • FILE : A tsv file with file locations and metadata to be pushed to Synapse

  • --dryRun: Perform validation without uploading

  • --sendMessages : Send notifications via Synapse messaging (email) at specific intervals, on errors and on completion

  • --retries: Number of retries for failed uploads

Example:

CODE
synapse sync ./my_manifest.tsv

Options:

  • Use --dryRun to validate your manifest before actually uploading files

  • Use --sendMessages to receive email notifications about the upload process

  • Use --retries to specify a different number of retry attempts for failed uploads

Using a manifest file is recommended for:

  • Multiple files because it is more efficient than individual uploads

  • Large datasets for ease of handling large transfers

  • Complex hierarchies to maintain folder structure

  • Tracking to provide a record of what was uploaded

Confirm that files appear in Synapse and that the file organization matches the plan

Don't worry if this seems technical - we're here to help if you need guidance.

  1. Set Up Datasets

πŸ”Ά What We'll Do:

Create Synapse Datasets to package your files, based on the entries in your Data Sharing Plan:

Command structure:

CODE
python build_datasets.py -d [DataDSP filepath or table Synapse Id] -n [Name for DSP CSV output]

Example:

POWERSHELL
python build_datasets.py -d syn45678910 -n updated_dsp.csv

We'll record Dataset identifiers in your sharing plan.

This organization makes your data easily discoverable and citable.

Verify that datasets are successfully created and that the keys are recorded in the manifest.

  1. Review Organization

πŸ”Ά What We'll Do: Verify your uploaded content.

We'll confirm that files match the content described in your DSP, related files are grouped in Datasets, and that file names follow standard conventions for discovery and reuse.

We don’t enforce a specific folder structure, however, the following is a basic high-level example of how research data is commonly organized:

BASH
data/
β”œβ”€β”€ raw/              # Original unmodified data
β”œβ”€β”€ processed/        # Data that has been transformed
└── analysis/         # Results and outputs

Your organization may differ based on your research needs and data types. We promote the use of standard naming conventions to make it easy for others to understand and use your data.

What Happens Next

After successful upload, we'll guide your data through these steps:

  1. File Organization and Dataset Review

Files are grouped into datasets. We validate file placement, check naming conventions, verify folder structure, and confirm metadata completeness.

  1. Access Control Setup

Files will be private while preparing Datasets for release. During this time, access restrictions are configured and IRB documentation is verified.

  1. Validation and Release

The release timeline, access requirements, and documentation are finalized. The metadata is validated and incorporated into Datasets for sharing alongside files.

Access Control Notes

Unless additional access requirements are necessary for responsible sharing, data will be released under Open Access.

If your data must be shared under an access requirement or you are unsure if an access requirement should be in place for your shared data, work with an MC2 Center data manager to:

  1. Identify sources of information to help determine applicable access requirements

  2. Document the requirements for each data type in your sharing plan guidance document linked in the "sharing_plans" folder

  3. Log Data Use Ontology (DUO) codes in your Data Sharing Plan:

  • Use column V "DSP Data Use Codes"

DUO codes are standardized terms that clearly specify how data can be used. Examples of DUO codes include:

  • NRES (DUO:0000004 - No restrictions)

  • GRU (DUO:0000042 - General research use)

  • NPU (DUO:0000045 - Not-for-profit use only)

  • HMB (DUO:0000006 - Health or medical or biomedical research only)

  • DS (DUO:0000007 - Disease specific research)

  • GS (DUO:0000022 - Geographical restriction)

These codes help make sure your data is used according to your requirements and within ethical considerations. The MC2 Center will use this information to ensure the appropriate access restrictions are implemented prior to the release of your data.

If you are unsure how to document your data sharing requirements or you’re not sure if your data requires access controls, please contact the MC2 Center for guidance.

Need Help?

We're here to support you through this process. Don't hesitate to Contact Us if you have questions or need guidance at any step.

Additional Resources

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.