Fred Hutchinson Cancer Research Center – 2
Transcript Splicing in Ovarian Cancer and in Diverse Normal Tissues
Christopher Kemp, Ph.D.
To identify transcript isoforms that may be unique to, or over-expressed in, ovarian cancer we have gathered RNA-seq data from a large number of serous ovarian tumors, related cell lines, and normal tissues. Sources of data include the TCGA serous cystadenocarcinoma (OV) study, the GTEx program surveying gene expression in many normal tissue types, the Illumina BodyMap 2.0 dataset, and the Cancer Cell Line Encyclopedia. Several additional ovarian tumors were profiled by RNA-seq and ribosome footprinting to assess evidence for translation.
We applied a consistent splicing-aware short-read processing workflow to all samples, using the same software versions, genome assembly and transcriptome reference sequences for all samples. Summary information about each splice junction, such as the number of supporting reads and maximum number of nucleotides sequenced on either side of each junction, was gathered into a large overview matrix with one row per splicing event and one column per sample.
We gathered from public repositories, or generated locally, RNA-seq data from a large number of source tissues including:
- 100 TCGA OV tumors
- 53 local serous cystadenocarcinoma tumors with at least 70% malignant cells
- 160 GTEx samples from 26 normal tissues
- 16 Illumina BodyMap 2.0 normal tissues (each profiled with single- and paired-end reads)
- 10 local benign ovarian tumors
- 6 CCLE ovarian cancer cell lines
All data were aligned with TopHat 2.0.9, in conjunction with Bowtie 2.1.0, using the UCSC hg19 human reference assembly as distributed with the Illumina iGenomes release of March 2013. TopHat was allowed to infer unannotated splicing events de novo. To reduce false positive inferences, TopHat was also provided with a collection of known gene models from the same iGenomes hg19 release.
The resulting splice-junction summaries, collected from TopHat junctions.bed files, were combined into a single table with one column for each sample and one row for each observed splice junction. Each cell contains the unfiltered count of reads spanning a particular splice junction in a given sample. We apply several filters to suppress inferred splice junctions that are likely to result from misalignment of reads. First we apply a very liberal prevalence and abundance filter, requiring at least two independent samples to support each junction with at least five reads; this excludes junctions supported by only a handful of reads in a small number of samples. We further require that each splice junction alignment be supported by at least 15nt of aligned sequence in each flanking exon; this eliminates many spurious junctions with little alignment support at the cost of excluding a small number of potential alignments involving microexons. Finally, we remove splicing events in regions known to be difficult to resolve correctly without special-purpose software or sample preparation, such as immunoglobulin loci on chromosomes 2, 22, and 14. Each of these steps eliminates predominantly de novo inferred splice junctions.
Public mRNA and EST sequence resources provide another rich source of information about splicing in a wide range of tissues, tumors, and cell lines. We have characterized several million such sequences by broad categories of tissue origin (normal tissue, tumor tissue, cell line, or non-cancer disease) and developmental stage (adult, fetal/embryonic, or pooled). Each row in the splice junction array is also annotated with the number of supporting mRNA and EST sequences falling into each of these categories. This information may be used to, for example, exclude splice junctions sampled by ESTs from normal tissue or to provide support for a novel junction inferred by RNA-seq that is also supported by a tumor-derived EST.
NanoString Profiling of MUC16 in Serous Ovarian Cancer, Benign Ovarian Tumor and Normal Tissues
Christopher Kemp, Ph.D.
This dataset contains NanoString nCounter gene expression profiles of serous ovarian cancer tumors, benign ovarian tumors, and various normal tissues.
By interrogating RNA-seq data from the TCGA and local samples, we identified previously unreported isoforms of MUC16, the gene that encodes CA125. To characterize the relative abundance of these variants in large numbers of tissues, we designed custom NanoString nCounter assays to quantify five novel exons and eight novel splice variants of MUC16. Panels also include probes to previously-reported MUC16 exons and splice variants and also to two markers, epithelial cell adhesion molecule (EPCAM) for epithelial cells and Protein Tyrosine Phosphatase, Receptor Type, C (PTPRC) (CD45) for hematopoietic cells, to help gauge relative amounts of different cell types in each sample.
For the NanoString assays presented here, total RNA was extracted from 28 serous ovarian cancer tumors, 1 ovarian cancer cell line (OVCAR3), 8 benign ovarian tumors, and a panel of 16 normal tissues (acquired commercially from Ambion and Agilent). 150ng of Total RNA from each sample was used as input to the standard NanoString Gene Expression Assay protocol. Data presented here are raw counts for each probe in each sample. Two files are provided, one for the exon assay and the other for the splicing assay, since these two assays must be run separately.
Spectral Nature of Fusion Transcripts in Serous Ovarian Cancer
Christopher Kemp, Ph.D.
To investigate the nature of fused transcripts in serous ovarian cancer (SOC), we have extracted a number of features & summaries from RNA-seq and ribosome protected tag (RPT) sequencing. These features include per-nucleotide read coverage histograms and read pairs that map discordantly, with its mate being on another chromosome or in an unexpected orientation.
For locally-derived RNA-seq samples, nearly all of the samples were prepped with the Illumina TruSeq kit and protocol.
We started with at least 500ng of total RNA and used the above to capture ployA+ mRNA, fragment the mRNA, convert to double-stranded cDNA, perform end repair, and add the adapters used for cluster generation/sequencing. Local sequence was generated with paired-end 50nt reads.
For locally-derived ribosome protected tag sequencing (files with “RPT" in the names) we used Nextera low-input kits.
For all samples, RNA-seq data from locally obtained SOC tissues, benign tumors, and the OVCAR3 cell line were aligned with the TopHat 2.0.9 splice-aware short read aligner, in conjunction with Bowtie 2.1.0. All alignments used the UCSC hg19 human reference assembly distributed with the Illumina iGenomes release of March 2013. To facilitate comparison, TCGA OV RNA-seq data was downloaded and processed with the same software versions, reference assembly, and reference annotations.
For each sample, various features of interest were extracted. One such feature set includes annotated lists of "discordant mates", where one read in a pair aligns to a gene of interest while its mate aligns to a different gene or even chromosome. A second set of features is, for each gene and sample of interest, a coverage histogram showing the number of reads covering each nucleotide. Coverage is reported separately for different classes of reads that might suggest a fusion, such as singletons (where a mate does not align) or reads whose mate is in a different gene.