Files and Metadata: Annotation and Organization

Overview

This guide explains how to document and organize your research data using metadata - descriptive information that helps others find, understand, and reuse your work.

The process typically takes 1-2 weeks, depending on:

Amount of data to document
Complexity of your experimental setup
Number of templates needed
Any validation issues that need addressing

The process involves collaboration between you (the contributor) and MC² Center staff so your data is:

Well-documented with standardized descriptions
Properly organized for easy access
Linked together in meaningful ways
Ready for others to discover and reuse

This approach follows FAIR principles, making your data:

Findable: Others can discover your data
Accessible: Clear access requirements
Interoperable: Uses standard formats
Reusable: Well-documented for reuse

CRITICAL: Working with Templates

Throughout this process, you'll work with metadata templates provided as Google Sheets. These templates help capture important information about your data in a standardized format. To ensure your metadata can be processed correctly:

ONLY record metadata in templates linked in your Synapse Project

Why This Process Matters

Each step in this process serves an important purpose:

Folder Organization

Makes data easy to find
Supports automated processing

Metadata Templates

Captures essential information
Provides data discovery

Submission Order

Maintains data relationships
Prevents missing links
Supports validation

Component IDs

Create clear relationships
Support data tracking
Allow future updates

Understanding Metadata Types

We use two primary types of metadata:

Record-based Metadata

Describes things like:
- Study information - Patient demographics
- Human participants or model systems - Cell line information
- Biospecimens - Sample processing methods
- Experimental details - Imaging parameters

File-based Metadata

Describes the actual data files such as:
- FASTQ files from sequencing
- Microscopy images
- Analysis results
- Supporting documentation

These two metadata types work together to tell the complete story of your research:

CODE

Study
  ├── Participants/Models ─┐
  ├── Biospecimens ────────┼──> Data Files
  └── Experimental Setup ──┘

Required Access

For Contributors:

Access to your grant-specific Synapse project
Access to metadata templates (provided by MC²Center)

For MC² Center Staff:

Administrative access to Synapse
Access to schematic CLI tools
Access to validation scripts

Instructions

Understand Your Data Organization

🔷 Contributor Role: Review how your Synapse project is structured.

Content should be organized in a standard folder structure:

CODE

Project/
├── data/                  # Your research data files
├── studies/               # Study information
├── biospecimens/          # Sample metadata
├── models/                # Model system and cell line metadata
├── individuals/           # Human patient metadata
├── sharing_plans/         # Data sharing info
├── governance/            # Governance documentation
├── publications/          # Metadata for publications
├── datasets/              # Metadata for released datasets
├── tools/                 # Metadata for released tools
├── education/             # Metadata for released educational resources

Access Your Templates

🔷 Contributor Role: Get your metadata templates.

CRITICAL: Working with Templates

Throughout this process, you'll work with metadata templates provided as Google Sheets. These templates help capture important information about your data in a standardized format. To ensure your metadata can be processed correctly:

ONLY record metadata in templates linked in your Synapse Project

To access your available metadata templates, navigate to the relevant folder in your Synapse project and select the linked sheet.

Example:

to access the Biospecimen template, open the biospecimens/ folder
to access the File View template, open the data/ folder

The linked template will be named according to the format: [grant number]_[data type]_[version]

Example: CA123456_Biospecimen_v10.0.0

Your Data Sharing Plan will be used to document which metadata templates to complete for your datasets.

For metadata types that apply to more than one dataset (e.g., Biospecimen, File View, Individual, Model), additional rows will be added to your Data Sharing Plan
For all Data Sharing Plan entries, the applicable metadata templates will be linked in column Y, “DSP Dataset Metadata“

Follow the Completion Order

🔷 Contributor Role: Complete templates in this order:

Model/Individual information (if applicable)

Describes your experimental system
Must come before Biospecimen data

Biospecimen information (if applicable)

Links samples to models/individuals
Must come before file metadata

File View metadata

Describes your actual data files
Links files to samples and study

Assay-specific metadata

Additional details about specific methods
Example: Imaging or sequencing parameters
Please contact the MC² Center for guidance on preparing assay-specific metadata

Resource metadata (can be submitted independently of metadata listed above)

Information about publications, datasets, computational tools, and educational resources associated with your grant

Why this order matters:

Assay-specific metadata typically includes “Key”-type attributes that link files and information.
Properly submitting and preparing metadata helps to ensure that content can be linked appropriately.

CODE

Study ID
  ├── Model/Individual ID
  │     └── Biospecimen ID
  │           └── File ID
  └── Dataset ID

Record Your Metadata

🔷 Contributor Role: Fill out your templates.

For each template:

Look for field descriptions in column headers

Hover over column names for detailed descriptions
Required fields are highlighted in blue
Optional fields provide additional context

Check “Sheet 2” for valid values:

Click “Sheet 2” tab at bottom (unhide if needed)
Find your column of interest
Use exact values from "Valid Values" column
Multiple values? Use commas to separate

Use comma-separated lists for multiple values

Example: "RNA-seq, ATAC-seq, ChIP-seq"

Common information sources:

Data files themselves
Quality control reports
Lab notebooks
Protocol documents
Analysis outputs
Publications

Upload any reference documents you used to the documentation folder

Understanding Component IDs

🔷 Contributor Role: Each entry needs a unique ID.

IDs follow these patterns:

Study: [Grant number]-[Journal/Type]-[Date]
Model: [Grant number]-M[Number]
Individual: [Grant number]-IND[Number]
Biospecimen: [Parent ID]-B[Number]

Example flow:

CODE

Study: GRANT123-CELL-2024
   └── Model: GRANT123-M1
        └── Biospecimen: GRANT123-M1-B1
             └── File: syn789012 (Synapse ID)

Submit for Validation

🔷 Contributor Role: Let MC² Center know when you're done.

If you are a contributor, update the MC² Center and STOP here.

🔶 MC² Center Role: We will:

Download your completed templates
Run validation checks to ensure:

All required fields are complete
Values match expected formats
IDs and keys are properly linked
Relationships between records are valid

Provide feedback if updates are needed
Upload validated metadata to Synapse

If validation fails:

We'll provide detailed feedback about:

Which fields need attention
What the specific issues are
How to correct the problems

You can then:

Make the requested updates
Ask questions if anything is unclear
Resubmit for validation

Common validation issues to watch for:

Missing required fields
Incorrect date formats
Invalid ID patterns
Missing relationships between records
Incorrect terms or spellings
Values not from approved list

After Validation

🔶 MC² Center Role: Once validation is successful, we:

Convert metadata to proper format
Upload to Synapse project
Apply metadata to relevant files
Update portal database
Prepare for eventual release

This process ensures your data is:

Properly documented
Correctly linked
Ready for discovery
Prepared for sharing

Example Workflows

Example 1: Imaging Dataset

A researcher wants to share microscopy data with some participant information:

Complete templates in order:
1. Individual template (participant info)
2. Biospecimen template (sample info)
3. File View template (file info)
4. Imaging Channel template
5. Imaging Level 2 template
Link everything together:

CODE

Study (GRANT123-IMG-2024)
├── Individual (GRANT123-IND1)
│     └── Biospecimen (GRANT123-IND1-B1)
│           └── Image Files (syn789012)
└── Dataset (syn456789)

Example 2: GeoMx Dataset

A researcher wants to share spatial genomics data:

Upload supporting files first:

Experimental config file
Probe config file
Lab worksheet
ROI coordinate files, if applicable

Complete templates in order:
1. Individual template (participant info)
2. Biospecimen template (sample info)
3. File View template
4. Imaging Channel template
5. ROI/segment template
6. GeoMx Auxiliary files template
7. GeoMx Level 1 template
8. GeoMx Level 2 template
9. GeoMx Level 3 template
10. GeoMx Imaging template
Example organization:

CODE

Project
  ├── biospecimens
      └── Biospecimen metadata template
  ├── individuals
      └── Individual metadata template
  ├── imaging_channel
      └── Imaging channel metadata template
  └── [study_id]/
      ├── File View metadata template
      ├── Auxiliary files/
          └── Auxiliary files metadata template
      ├── ROI Data/
          └── ROI metadata template
      ├── GeoMx Level 1/
          └── GeoMx Level 1 metadata template
      ├── GeoMx Level 2/
          └── GeoMx Level 2 metadata template
      ├── GeoMx Level 3/
          └── GeoMx Level 3 metadata template
      └── GeoMx Imaging/
          └── GeoMx Imaging metadata template

Validation Process Timeline

The validation process typically takes:

Initial review: 1-2 business days
Each revision cycle: 1-2 business days
Final validation: 1-2 business days

Factors that can affect timing:

Number of templates to validate
Complexity of relationships
Number of validation issues
Response time for revisions

Example Templates

CRITICAL: Working with Templates

Throughout this process, you'll work with metadata templates provided as Google Sheets. These templates help capture important information about your data in a standardized format. To ensure your metadata can be processed correctly:

ONLY record metadata in templates linked in your Synapse Project

Resource metadata example templates:

Need Help?

We're here to support you through this process. Don't hesitate to Contact Us if you have questions or need guidance at any step.

Overview

Why This Process Matters

Understanding Metadata Types

Required Access

Instructions

Example Workflows

Example 1: Imaging Dataset

Example 2: GeoMx Dataset

Validation Process Timeline

Example Templates

Need Help?

Additional Resources