## Processing GTEx for tissue-specific gene expression

The Genotype-Tissue Expression project (GTEx) RNA-sequenced [1]:

1641 samples from 175 individuals representing 43 sites: 29 solid organ tissues, 11 brain subregions, whole blood, and two cell lines: Epstein-Barr virus–transformed lymphocytes (LCL) and cultured fibroblasts from skin.

The data is available online. Specifically, we are interested in the GTEx_Analysis_V4_RNA-seq_RNA-SeQCv1.1.8_gene_rpkm.gct.gz file that contains RPKM expression values for each sample. We would like to calculate a single expression value for each gene-tissue pair. Expression values should be comparable across tissues, not just within tissues.

We will post our questions here. Advice appreciated.

Daniel Himmelstein Researcher

# Mapping GTEx sites to Uberon and CL

We are using Uberon [1] terms to identify anatomical structures and Cell Ontology (CL) [2] terms to identify cell types. Thus, we need to map GTEx sites to their corresponding ontology terms.

From the sample attribute documentation (GTEx_Data_V4_Annotations_SampleAttributesDS.txt), we identified 54 sites using the SMTSD attribute. I have mapped about half of the sites to Uberon. The remainder would benefit from a skilled anatomist or GTEx consortium member.

Bounty: Add or correct our mappings using this spreadsheet and put your Thinklab username. Then leave a comment in this discussion, and we will rate its value $\geq 4 \times n$, where n is the number of mappings provided.

Some additional sample site information is available in Table S1 (p. 58) of the supplement.

Daniel Himmelstein Researcher

# Handing over GTEx processing responsibility to Bgee

We have decided to use Bgee for tissue-specific transcript presence, over-, and under-expression. Bgee doesn't currently include GTEx data but will soon.

Therefore, we are not going to proceed with GTEx data directly for this project. However, we did already process the data into a usable gene × site format (notebook, download). We converted genes to Entrez GeneIDs. The sites are still in GTEx strings rather than Uberon terms. Expression values are log-transformed. Check out the notebook for a visualization of tissue-specific transcript abundance distributions.

Bounty: we will keep the GTEx–Uberon mapping bounty going until June 25, 2015 because these mappings will help the Bgee team and eventually us. @chrismungall, do you want to add a comment here, so you can get rewarded for your 3 mappings?

I think the Bgee team will do a great job. Just a few general comments:

most of the GTEx terms correspond to 'wild-type' structures as can be found in uberon/cl. There are however, two subclasses of skin: exposed and unexposed. We could add these as subclasses in uberon, but this would be unusual. It would be better to either post-compose these, or to have some kind of ancillary 'sample' ontology where this is composed.

For 'hippocampus', the safest option is to map to the broadest term, 'hippocampal formation', but if it can be shown than the GTEx sample excludes bits of the dentate gyrus then the more specific 'ammons horn' can be used.

Finally, it's always best when ontologies are used prospectively rather than retrospectively, maybe future rounds of GTEx will follow the lead of FANTOM5 and ENCODE in doing this.

IMO, the "exposed/unexposed" state is an experimental factor, and should not be annotated using a new anatomical term, it should be an additional "column" in an annotation (using, e.g., EFO).

Daniel, we will discuss next week during our lab meeting the timescale to annotate GTEx data, but as you said, it should be fast. Our problem is that we requested access to the data, and are waiting for an answer.

We could start working on the mapping you have, but we'd rather go through the information for each sample, to check for normality, etc. This is how we usually do.
(do they provide GEO or SRA identifiers for the samples BTW?)

• Daniel Himmelstein: What does SRA stand for?

• Frederic Bastian: Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra). This is where we usually download the raw data from. But I think GTEx keep them private, for their "on demand" data sharing policy...

Daniel Himmelstein Researcher

do they provide GEO or SRA identifiers for the samples BTW?

@fbastian, I do not see any other sample identifiers than SAMPID (GTEx Public Sample ID) in the sample attributes documentation spreadsheet. The IDs are formatted like GTEX-N7MS-0007-SM-2D7W1 — not sure whether that corresponds with other databases.

We could start working on the mapping you have, but we'd rather go through the information for each sample, to check for normality, etc.

@fbastian, do whatever is best for you! And let it be known that your painstaking and thorough integration efforts are appreciated.

There are however, two subclasses of skin: exposed and unexposed.

@chrismungall, by post-compose do you mean contacting GTEx and asking for more details on the skin sample sites? That seems the best to me as I assume the sample collectors had specific instructions. The skin sites are specified as suprapubic for sun unexposed and lower leg for sun exposed.

@dhimmel, thanks for the information. Did your lab formally request access to the data to get the actual annotations? (GTEx_Data_V4_Annotations_SampleAttributesDS.txt in your notebook)

Otherwise, Chris is speaking about ontology term post-composition, a way of creating a new ontology concept on-the-fly, that doesn't have any identifier or IRI ("anonymous class expression"), and that is made of the "composition" of several other terms. That would allow you to create on the fly a new concept for "exposed skin".

See for instance, in zebrafish ontology: https://zfin.org/action/ontology/post-composed-term-detail?superTermID=ZFA:0001117&subTermID=ZFA:0000155
There is no term "post-vent region somite" in the ontology, but the concept is represented by using the existing terms "post-vent region" and "somite".

Daniel Himmelstein Researcher

@fbastian, GTEx_Data_V4_Annotations_SampleAttributesDS.txt is available from the GTEx download page which requires an account. However, this direct link circumvents the login page. How long has it been since you submitted the data access request?

Not very long, a week or so.

Just to anticipate, do they provide more detailed information somewhere else, like, e.g., http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM81022 ? (extraction protocols, detailed information about the anatomical structure, etc)

Daniel Himmelstein Researcher

do they provide more detailed information somewhere else

No idea, I did email the GTEx support with a link to this thread, so perhaps they'll provide some clarification.

FYI, our curator Anne Niknejad has started editing your mapping, she will also create the issues on the Uberon tracker to request new terms.

Daniel Himmelstein Researcher

On June 17th I emailed gtex-help@broadinstitute.org asking for a GTEX–Uberon mapping. Today, Tim Sullivan responded and attached this mapping file.

He didn't mention the methodology used, but you may want to crosscheck your work.

I've reproduced Tim's mapping below for quick reference:

Tissue Site DetailUberon CodeUberon Term
Artery - Aorta0001496ascending aorta
Artery - Coronary0001621coronary artery
Artery - Tibial0001323tibial nerve
Artery - Tibial0007610tibial artery
Brain - Amygdala0001876amygdala
Brain - Anterior cingulate cortex (BA24)0009835anterior cingulate cortex
Brain - Caudate (basal ganglia)0001873caudate nucleus
Brain - Cerebellar Hemisphere0002037cerebellum
Brain - Cerebellum0002037cerebellum
Brain - Cortex0001870frontal cortex
Brain - Frontal Cortex (BA9)0009834dorsolateral prefrontal cortex
Brain - Hippocampus0001954Ammon's horn
Brain - Hypothalamus0001898hypothalamus
Brain - Nucleus accumbens (basal ganglia)0001882nucleus accumbens
Brain - Putamen (basal ganglia)0001874putamen
Brain - Spinal cord (cervical c-1)0006469first cervical spinal cord segment
Brain - Substantia nigra0002038substantia nigra
Breast - Mammary Tissue0008367breast epithelium
Cells - EBV-transformed lymphocytesEFO_0000572lymphoblast
Cells - Leukemia cell line (CML)EFO_0002067K562
Cells - Transformed fibroblastsEFO_0000496fibroblast
Cervix - Ectocervix0012249ectocervix
Cervix - Endocervix0000458endocervix
Colon - Sigmoid0001159sigmoid colon
Colon - Transverse0001157transverse colon
Esophagus - Gastroesophageal Junction0004550gastroesophageal sphincter
Esophagus - Mucosa0006920esophagus squamous epithelium
Esophagus - Muscularis0004648esophagus muscularis mucosa
Fallopian Tube0003889fallopian tube
Heart - Atrial Appendage0006631right atrium auricular region
Heart - Left Ventricle0006566left ventricle myocardium
Kidney - Cortex0001225cortex of kidney
Liver0001114right lobe of liver
Lung0008952upper lobe of left lung
Minor Salivary Gland0006330anterior lingual gland
Muscle - Skeletal0011907gastrocnemius medialis
Nerve - Tibial0001323tibial nerve
Nerve - Tibial0007610tibial artery
Ovary0002119left ovary
Pancreas0001150body of pancreas
Pituitary0000007pituitary gland
Prostate0002367prostate gland
Skin - Not Sun Exposed (Suprapubic)0001416skin of abdomen
Skin - Sun Exposed (Lower leg)0001511skin of leg
Small Intestine - Terminal Ileum0001211Peyer's patch
Spleen0002106spleen
Stomach0000945stomach
Testis0000473testis
Thyroid0002046thyroid gland
Uterus0000995uterus
Vagina0000996vagina
Whole Blood0013756venous blood

Some seem slightly more specific than the label suggests - sometimes the increased specificity is trivial (ie their ovary sample was from a left ovary), sometimes relevant (their representative skeletal muscle sample was from gastrocnemius medialis, the esophagus mucosa sample was taken from the epithelium rather than lamina propria).

Actually, for some mappings we needed to request for new terms in Uberon, e.g.: https://github.com/obophenotype/uberon/issues/725

Things will be slow until mid-August on our side.

To keep you posted: we were recently given access to the GTEx data, so we have started annotating/analyzing the data. We hope to have a new release of Bgee including these data in about 2 months.

• Daniel Himmelstein: Exciting, thanks for the update. On another note, I wasn't able to find any licensing information on the Bgee website, which technically means all rights reserved. We're tying to compile licenses for each resource we use. It would be great if you could add a license.

• Frederic Bastian: Yep, I wanted to reply to you after checking all our datasources. I think we are going to use a cc-by, but I'll contact you when I can confirm that for sure.

Status: Declined
Views
685
Topics
Cite this as
Daniel Himmelstein, Chris Mungall, Frederic Bastian (2015) Processing GTEx for tissue-specific gene expression. Thinklab. doi:10.15363/thinklab.d82