Files and Metadata: Packaging and Release | Cancer Complexity Knowledge Portal Docs

Overview

This guide explains how to prepare and release datasets in Synapse. It covers organizing metadata, setting up access controls, and making data available to the research community. The process ensures data is well-documented, properly controlled, and ready for reuse.

The process involves:

Organizing metadata into shareable formats
Setting up appropriate access controls
Validating all relationships
Releasing data for access

Required Access

For MC² Center Staff:

Administrative access to Synapse
Access to metadata templates
Access to validation tools
Governance team contact

Instructions

Mint Dataset Digital Object Identifiers (DOIs)

🔶 MC² Center Role: Prepare Dataset View metadata for contributor review

DOIs will be recorded in DatasetView manifest during the next step
When creating the DOI, at minimum, include the first author and primary investigator
If other authors are known, based on a pre-print or provided Study metadata, include their names, as well

Prepare Dataset Annotations

🔶 MC² Center Role: Prepare Dataset View metadata for contributor review

Once datasets are ready for release, Dataset View records will be recorded in the Dataset View metadata template linked in your Synapse project and added to the Cancer Complexity Knowledge Portal database.

🔷 Contributor Role: Review Dataset View metadata

Apply Dataset Annotations

🔶 MC² Center Role: Run script table_to_annotations.py to annotate Datasets with Dataset View metadata

python table_to_annotations.py -t [Dataset Synapse Id] -v [DatasetView metadata table Synapse Id]

Annotate Files with Record Metadata

🔶 MC² Center Role: Run script table_to_annotations.py to extract and apply metadata to files contained in Datasets

python table_to_annotations.py -t [Dataset Synapse Id] -f [File View metadata table Synapse Id] -s [Biospecimen metadata table Synapse Id] -i [Individual metadata table Synapse Id] -m [Model metadata table Synapse Id]

Adjust Dataset schemas
- Remove from Dataset schema:
  - StudyKey
  - Id
  - FileViewId
  - EntityId
  - Component
  - Any other Component_Id attributes applied. Retain Component Keys only
  - Any other redundant fields
- From automatically applied annotations, only retain the following:
  - Id
  - Name
  - Path
  - currentVersion
  - dataFileSizeBytes
  - dataFileMD5Hex
Configure Access Controls

🔶 MC² Center Role: Set up data access:

Review sharing requirements:
- Check Data Sharing Plan
- Verify IRB documentation
- Confirm institutional certification
Bind the appropriate access requirement (AR) JSON schema to the Synapse project or incorporate the AR JSON schema into data validation schemas.

This is only required if the data will be released under access requirements.

Access Levels:
├── Open Access
│   ├── Anyone on the internet can view
│   └── Registered Synapse users can download
└── Access Requirements 
    ├── User must accept conditions of use to gain access
    │   to files (Conditional Access)
    └── User must submit an access request or provide
        documentation to gain access to files (Controlled Access)

Ensure annotations on Datasets and data storage folders align with the access requirement schema:
- Tag Datasets and data storage folders with appropriate annotations
- Configure folder-level restrictions
- Set file-specific controls if needed
Work with Governance to verify that access controls have been accurately applied

Validate Release Package

🔶 MC² Center Role: Perform final checks:

Metadata validation:
- All required fields present
- Keys properly linked
- Relationships valid
- No missing connections
Access control validation:
- DUO codes properly applied
- AR schema bound correctly
- Permissions working as expected
- Test access paths

Release Data

🔶 MC² Center Role: Make data available:

For Open Access data:
- Set permissions to public for Datasets and data storage folders
For Controlled Access data:
- Verify AR implementation
- Test access request process
- Document approval workflow
- Set permissions to public for Datasets and data storage folders
Final verification:
- Test all access paths
- Verify download functionality
- Check permission inheritance
- Ensure DatasetView metadata has been integrated into the Cancer Complexity Knowledge Portal staging tables

Timeline Expectations

Metadata Organization: 2-3 business days

Metadata extraction and entity annotation
Relationship verification
Schema configuration

Access Setup: 1-2 business days

AR configuration
Permission testing

Validation: 1-2 business days

Metadata checks
Access verification
Final testing

Common Issues and Solutions

Metadata Relationships

Issue: Missing key relationships
Solution: Use validation scripts to identify gaps
Prevention: Follow key naming conventions

Access Controls

Issue: Incorrect permission inheritance
Solution: Check folder hierarchy permissions
Prevention: Test access at each level

Schema Display

Issue: Hidden required fields
Solution: Review schema configuration
Prevention: Use schema templates

Support

We're here to support you through this process. Don't hesitate to Contact Us if you have questions or need guidance at any step.