Data Submission Under the Genomic Data Sharing (GDS) Policy

  • Resize font
  • Print
  • Email
  • Facebook
  • Twitter
  • Google+
  • Pinterest
Credit: iStock

Data Sharing Expectations

Data sharing allows data generated from one research study to be used to explore a range of additional research questions. Enabling the combination of data from multiple projects amplifies the scientific value of data.

Data reuse is facilitated when the data conform to accepted data standards. This helps reduce the learning curve for researchers and minimizes potential errors from misunderstanding the data or metadata. Those depositing data to GDS repositories are encouraged to utilize existing, well-documented data standards to help ensure the quality and usefulness of the submitted datasets, and create a more efficient process. Such standards include:

  • Data should generally be submitted once it has been cleaned (e.g., the analytical dataset is finalized).
  • Data pertinent to the interpretation of genomic data—such as associated phenotype data (e.g., clinical information), exposure data, and descriptive information (e.g., protocol or methodologies used) should be shared. Metadata around the experiment or study and annotations that are necessary to reproduce any published table or analysis must be included with genomic data submissions.
  • Specimen acquisition, experimental procedures, and data processing and analysis methods (e.g., alignment algorithms, software versions, etc.) are required with data submission.
  • Terms for disease, cell type, tissue type, and other annotations should be linked to the NCI Thesaurus (NCIt)
  • If an NCIt identifier is not available, utilize other identifiers, such as Uniform Medical Language Systems (UMLS) or an ontology term from an existing ontology.

Wherever possible, use existing common data elements (CDEs). For clinical specimens, the data elements that would be included in reporting to clinicaltrials.gov are required.

Expectations for Data Submission Formats

Different data types undergo different levels of data processing, which set the expectations for data submission and data release. Table 1 describes the expectations for each level. NIH will review these expectations at regular intervals, and will publish updates on the GDS website and notify the research community through appropriate communication methods (e.g., NIHNIH Guide for Grants and Contracts).

Note that information necessary to interpret controlled-access genomic data, such as study protocols, data instruments, and survey tools, should be submitted to share on an unrestricted basis (i.e., through unrestricted access) concurrent with the relevant Level 1, 2, 3, or 4 genomic data.

Different submission requirements apply depending on the level of genomic data:

  • Level 0: Raw data generated directly from the instrument platform. Data submission is not required for Level 0 genomic data
  • Level 1: Initial sequence reads, the most fundamental form of the data after the basic translation of raw input
  • Level 2: Data after an initial round of analysis or computation to clean the data and assess basic quality measures
  • Level 3: Analysis to identify genetic variants, gene expression patterns, or other features of the dataset
  • Level 4: Final analysis that relates the genomic data to phenotype or other biological states
  • Metadata: Information around the experiment or study
Table 1: Expectations for Data Submission Formats by Data Processing Level
Data Type Level 1 Level 2 Level 3 Level 4
SNP array data from > 500K single nucleotide polymorphisms (SNPs) (e.g., GWAS data) .CEL | .TXT | .IDAT
Note: submission of .IDAT files for human sample data will be decided on a case-by-case basis
N/A .TXT .TXT
DNA sequence data from < 100 genes or regions of interest (e.g., targeted sequencing) N/A .BAM Arrays: .TXT
NGS: .MAF | .VCF | .PED
.TXT
DNA sequence data from ≥ 100 genes, regions of interest (e.g., targeted sequencing, whole exome sequencing, whole genome sequencing) N/A .BAM Arrays: .TXT
NGS: .MAF | .VCF | .PED
.TXT
RNA sequencing (RNA-seq) data (e.g., transcriptomic and targeting RNAseq data) .FASTQ | .SFF | .HDFS | Complete genomics native
Note: required for human sample data only
N/A Arrays: .TXT
NGS: .WIG | .TXT
.TXT
Genome-wide DNA methylation data (e.g., bisulfite sequencing data) N/A .BAM Arrays: .TXT
NGS: .MAF | .VCF | .TXT | .BED
.TXT
Genome-wide chromatin immunoprecipitation sequencing (ChIP-seq) data (e.g., transcription factor ChIP-seq, histone modification ChIP-seq) N/A .BAM Arrays: .TXT
NGS: .WIG | .TXT | .BED
.TXT
Metagenome (or microbiome) sequencing data (e.g., 16S rRNA sequencing, shotgun metagenomics, whole-genome microbial sequencing) N/A .BAM NGS: .WIG | .TXT .TXT
Metatranscriptome sequencing data (e.g., microbial/microbiome transcriptomics) N/A .BAM NGS: .WIG | .TXT .TXT
Metadata Metadata around the experiment or study and annotations that are necessary to reproduce any published table or analysis must be included with genomic data submissions. In particular, data pertinent to the interpretation of genomic data—such as associated phenotype data (e.g., clinical information), exposure data, relevant metadata, and descriptive information (e.g., protocols or methodologies used)—are expected to be shared.

The NIH National Center for Biotechnology Information (NCBI) provides general guidance for submitting data to NIH data repositories. More specific instructions for data submission, including data standards, are available for a number of NIH repositories:

Data Sharing Plans (DSPs)

Prior to the start of GDS policy-covered research, all investigators must develop and have in place an approved data sharing plan (DSP). NCI expects that DSPs will be collected and reviewed at the earliest point in time. NCI staff will assess whether the project falls within the scope of the GDS policy, and if so, whether the DSP is adequate based on NIH Guidance for Investigators in Developing Data Sharing Plans.

DSPs for Extramural Programs

Extramural investigators should submit their DSP as part of their funding application. As such, DSP requirements should be discussed as early in the pre-award process as possible. The approved DSP should be submitted at Just-in-Time (JIT), along with the Institutional Certification. Program directors must approve the DSP prior to funding.

DSPs for Intramural Programs

Differences in study type (e.g., studies involving model organisms) and how scientific review takes place within the NCI intramural research programs will dictate when the DSP can be reviewed.

  • Prospective Scientific Review: The DSP should be submitted to, and reviewed by, the scientific director (SD), or delegate, and genomic program administrator (GPA) at the time the funding decision is made
  • Retrospective Scientific Review (e.g., quadrennial site visits): The DSP should be submitted to, and reviewed by, the SD (or delegate) prior to data generation

Institutional Certification

The Institutional Certification assures that institutions planning to submit human genomic data to NIH will meet the expectations of the GDS policy. The certification, provided by the principal investigator and the institutional signing official (SO) of the submitting institution, clearly delineates any “data use limitations (DULs)” on the research use of the data, as expressed in the informed consent documents signed by study participants.

For multicenter studies (with samples collected at several institutions), NIH understands that the submitting institution is not necessarily the local institution or IRB of record for all sites. However, the submitting institution should assure NIH that it believes, based on either its own review or assurance from other institutions, that the expectations of the policy are met for the entire dataset. Institutions may choose to collect and submit a single-site certification from each site contributing samples or submit a multi-site certification. Single and multi-site Institutional Certifications for both intramural and extramural studies can be found on the GDS website.

An Institutional Certification should be submitted at the earliest possible point in time. The certification should be provided to NCI prior to award, along with any other JIT Information (for extramural researchers) or at the time of scientific review (for intramural researchers).

Submitting Data

Because of the variation in how NCI intramural and extramural operate, the process for data submission will be different depending on whether you are an Intramural Investigator or an NCI funded grantee. View the process pertinent to you:

Data Release

Following data submission, the data may be accessible only to the submitting investigators and collaborators for a period not to exceed six months.

Data will be released and available via controlled access for research that is consistent with the dataset’s “data use limitations” either six months after the submission process is initiated or at the time of first publication (whichever comes first).

The GPA will determine if a shorter timeframe is warranted based on the publication status of the initial publication. Community resources could be released earlier than the six-month deferral regardless of publication status.

  • Updated: February 16, 2017

Most text on the National Cancer Institute website may be reproduced or reused freely. The National Cancer Institute should be credited as the source. Please note that blog posts that are written by individuals from outside the government may be owned by the writer, and graphics may be owned by their creator. In such cases, it is necessary to contact the writer, artist, or publisher to obtain permission for reuse.

We welcome your comments on this post. All comments must follow our comment policy.