Project:
Rephetio: Repurposing drugs on a hetnet [rephetio]

Tissue-specific gene expression resources


We would to know the expression level of each gene in as many tissues and cell types as possible. Humans only for now. What are the best resources and methods to go about this?

Here are some relevant resources we compiled (in order of preference):

  1. The Genotype-Tissue Expression project (GTEx) [1]
  2. Baseline Expression Atlas (BEA) [2]
  3. Human Protein Atlas (HPA) [3]
  4. GNF Gene Expression Atlas (BodyMap) [4]
  5. Human Proteome Map (HPM) [5]
  6. Gene Enrichment Profiler (GEP) [6]
  7. A global map of human gene expression (E-MTAB-62) [7]
  8. Unveiling RNA Sample Annotation (USRA) [8]
  9. RNA-Seq Atlas [9]

Additional resources found since initial post:

@marinasirota recommended GTEx because of the large number samples per tissue and the use of RNA-seq.

Bgee

On a GitHub issues discussion on mapping GTEx data to Uberon, Frederic Bastian suggested Bgee as a database for human tissue-specific expression measurements under normal conditions — exactly what we're looking for.

Bgee was designed for comparative genomics [1] and therefore maps sample sites to standard vocabularies (like Uberon) and contains data for many species (not needed now but may be in the future).

Genes are in Ensembl identifiers which we can easily convert to Entrez GeneIDs. Highly processed and relevant datasets appear to be available:

  • Presence/absence of expression
  • Over-/under-expression across anatomy or life stages

Specifically, we want to create three matrices with Entrez GeneID columns and Uberon/CL rows. We are interested in the adult development stage. The values should binary or continuous allowing us to pick an inclusion threshold later. The three desired matrices of tissue-specific expression are:

  1. Transcript presence in adult
  2. Transcript over-expression compared to other adult human sites
  3. Transcript under-expression compared to other adult human sites

Bgee does not yet include GTEx RNA-Seq data but will integrate this data in the coming months. Question for Bgee team: do you compute transcript abundance from raw data? I am not stoked about the Flux Capacitor software used by GTEx for reasons pointed out by others.

Bgee parameters

The Bgee human downloads are user-friendly and should be easy to process. We just have a few questions:

Transcript presence

For determining expression presence, we want to pick a single and broad developmental stage, such as adult. The propagation up the developmental ontology means we don't need to pick a stage that has been directly assayed by many studies. Post-juvenile adult stage (UBERON:0000113) is used for the differential expression analysis. Which stage should we choose?

Next, what evidence threshold should we require to consider a gene expressed I was thinking requiring call_quality == 'high quality' and expression in {'present', 'low ambiguity'}. We could also use a more complex metric on the complete file.

Differential expression

We would like to include two differential expression edges in our network: one for under-expression and one for over-expression. For the over-expression dataset, is call_quality == 'high quality' and differential_expression == 'over-expression' the right filter?

do you compute transcript abundance from raw data? I am not stoked about the Flux Capacitor software used by GTEx for reasons pointed out by others.

Yes, we do. We map reads to transcriptome using TopHat2; read counts extracted using HTseq; length of longest annotated transcript used.
(edit: woops, you said transcript abundance. We only provide expression data "per gene")

Currently, we do not use TMM normalization between samples for the RPKM values we provide, but we will in the next release. We also want to get away from RPKM values, and provide TPM values (for motivation, see [1]).

To generate differential expression calls, we use TMM normalization, then voom + limma. This has been shown to perform well (see [2]).

We will provide complete documentation by July.

The values should binary or continuous allowing us to pick an inclusion threshold later.

So have you noticed that we also provide RPKM values, not only the qualitative "calls"?

Post-juvenile adult stage (UBERON:0000113) is used for the differential expression analysis. Which stage should we choose?

Post-juvenile in human also includes 13-18 yo, so I don't know if you want to include those. HsapDv:0000087 "human adult stage" would be a "true" adult stage. But I think we don't use it for "differential expression across anatomy", we could correct that if you need it. It should be correctly used for "differential expression across development", if we have the data. And of course it is correctly used for the "presence/absence" calls.

Note that "differential expression" calls are not propagated, so you might still need to examine all stages for "differential expression across development". But for "presence/absence", you can safely rely on the propagation, and only retrieve results for your stage of interest.

You might want to use the "complete" file, you will have more data propagated. Also, if you use the "complete" file, you might decide to rely only on RNA-Seq data, and avoid the "ambiguity" states.

what evidence threshold should we require to consider a gene expressed I was thinking requiring call_quality == 'high quality' and expression in {'present', 'low ambiguity'}. We could also use a more complex metric on the complete file.

Presence low quality seems also to give good results, but here it is up to you to decide whether you want to be more on the false negative side, or more on the false positive side.

For the over-expression dataset, is call_quality == 'high quality' and differential_expression == 'over-expression' the right filter?

I would definitely use the "low quality" as well here. Because the overall call generated is based on a voting system weighted by p-values, so even if it is "low quality" because of conflicting analyses, the best p-value has won anyway.
Again, if you use the "complete" file, you might decide to rely only on RNA-Seq data, and avoid the "ambiguity" states.

Edit: oh, I didn't notice you where mentioning "transcript abundance". We only provide expression data "per gene", not "per transcript".

  • Daniel Himmelstein: My mistake regarding "transcript abundance" — we actually only care about gene abundance.

  • Frederic Bastian: @dhimmel: OK, it should fit your needs then. Let me know if you have any other question.

We have done some initial analyses on the Bgee data (notebook). We stuck with the simple files because:

  • we would like to outsource as much of the difficult decision-making as possible.
  • we want a method that is resilient to changing technologies and Bgee revisions — not a method that is only optimal for a specific Bgee release.

Here are our initial findings.

Gene presence

We chose the filter:

call_quality in {'high quality', 'low quality'} and expression in {'present', 'low ambiguity'}

We chose a permissive filter that is not limited to RNA-Seq. In the future, when Bgee contains more sources, we could increase these thresholds.

The developmental stage "25-44 year-old human stage" (HsapDv:0000090) has the most present genes. "Human adult stage" (HsapDv:0000087) is comparatively quite underpopulated with major organs like heart having no present genes (Figure 1). I was surprised because I thought adult would subsume 25-44 and the expression data would be propagated. @fbastian, any idea what is going on? Should we take 25-44 as our developmental context or should we collapse data across adult developmental stages. Is the differential-expression data lacking if it's only calculated on "post-juvenile adult stage"?

Compared to the distribution of expressed genes in the GNF Expression Atlas (Figure 3), Bgee had much more variation by tissue. This is expected since Bgee integrates diverse data of varied throughput and a is an acceptable sacrifice for broader input data.

Differential expression

We identified differential expression with call_quality in {'low quality', 'high quality'}. Figure 2 shows that we identified a large number of DE-genes across a variety of anatomical entities. Some tissues or cell types, such as leukocytes and testis, have over 10,000 under/over-expressed genes (almost every gene). Is this biologically meaningful or the result of data coverage or other experimental artifacts?

(on a side note, I didn't know IPython/Jupyter, I discovered it thanks to you and I fell in love with it, this is amazing; also, it's great to have other people to investigate our data, and check their quality)

I was surprised because I thought adult would subsume 25-44 and the expression data would be propagated

So, we propagate the data, but in the simple file we only display conditions that were actually observed in experimental data (no conditions appearing from propagation only). It means that we never had an experiment studying "heart" at stage "human adult stage". Which makes sense, because annotating experimental data with such a broad term basically mean "we know it was a heart from an adult, but no idea which age"; we often have more detailed information.

This is why the complete file would work better for you, as we also display conditions from "propagation only"; you should see lots of data for "heart" at "human adult stage". Of note, even in the complete file you get the "global" call generated by us, see column 7 and 8: http://bgee.org/?page=doc&action=call_files#single_expr_complete
So you wouldn't have to do the decision-making, even when using the complete file.

But your problem is interesting, maybe we should "relax" the filtering of conditions in the simple file, I will think about it.
This difference in filtering was explained in the documentation, should we clarify it?

Some tissues or cell types, such as leukocytes and testis, have over 10,000 under/over-expressed genes (almost every gene). Is this biologically meaningful or the result of data coverage or other experimental artifacts?

I think it is meaningful. First, it is not almost every gene, from the complete file (where you can also find genes never shown to have differential expression) you can see that we have about 20,000 genes tested for differential expression; so, in the majority of tissues you get a reasonable percentage of genes differentially expressed. Second, I'm not surprised by the outlier tissues you found, for instance we know that testis and brain have very specific expression. Third, you almost always found more genes under-expressed than over-expressed, and this is actually an interesting pattern, that we have observed by other methods.

I will discuss this with my colleagues to be sure they get to the same conclusion, and will get back to you. But I think everything looks good.

Completed Bgee analysis — version 0

Tissue-specific gene presence

we propagate the data, but in the simple file we only display conditions that were actually observed in experimental data (no conditions appearing from propagation only)

Okay I switched to the complete file for presence/absence. The updated Figure 1 makes much more sense now. We chose to go with "human adult stage" (HsapDv:0000087) as the developmental stage, which now has broad coverage across tissues, however with fewer present genes than "life cycle" as expected. We converted to entrez genes and ended up with a matrix of 18,997 genes (16,278 coding) × 666 anatomical entities (download). On average, 41.1% of genes were expressed in a given anatomical entity.

But your problem is interesting, maybe we should "relax" the filtering of conditions in the simple file, I will think about it. This difference in filtering was explained in the documentation, should we clarify it?

I misinterpreted the documentation by assuming that if a developmental stage had any annotations, propagated values would be provided for all anatomies of that stage. Instead, propagated values are only included for stage–anatomy combinations with annotations. Perhaps the documentation should make it more clear that many use cases will require the complete file due to this issue. Another consideration is that processing the complete file was consuming up to 80 GB of RAM at points — not an issue personally but could be for others.

Tissue-specific differential expression

We converted the differential expression to entrez genes and ended up with a matrix of 18,620 genes (16,184 coding) × 98 anatomical entities (notebook, download).

General Bgee feedback

@fbastian, in the presence download, for a given gene–stage–anatomy combination is there at most one row? For the anatomy-based differential expression download, for a given gene–anatomy combination is there at most one row? Better documentation of what uniquely defines an observation (row) would be helpful.

Since the downloaded zip files only contain a single file, I think gzip compression makes more sense.

Finally, I prefer lowercase column names with underscores rather than spaces. This naming convention avoids frustrating R munging and enables unquoted variable reference in R and python. Understandably, you may not want to change for compatibility issues.

Some tissues or cell types, such as leukocytes and testis, have over 10,000 under/over-expressed genes (almost every gene). Is this biologically meaningful or the result of data coverage or other experimental artifacts?

I discussed this with my colleagues as promised, they agree that everything looks fine. We just get to the conclusion that maybe we could use a FDR correction over all analyses (currently, p-values are corrected on a "per analysis" basis), to decrease the number of differentially expressed genes in the most studied organs.

matrix of 18,997 genes (16,278 coding) × 666 anatomical entities

Great, I just want to warn you that lots of these structures are not independent (e.g., "cerebellum" is not independent from "brain"). I don't know what type of analyses you plan, but this can sometimes be problematic. If you need it, I can provide you a script accepting a list of organs as argument, and returning only the most-precise "independent" organs.

in the presence download, for a given gene–stage–anatomy combination is there at most one row

Yes.

For the anatomy-based differential expression download, for a given gene–anatomy combination is there at most one row?

No, because an analysis could compare organ A, B, C at stage embryo, another one compare the organs A, B, C at stage adult. So, for a given gene, you would have two entries for organ A: one at stage embryo, another one at stage adult.
So, it is as for presence download, it is at most one row for a given gene–stage–anatomy combination.

I think gzip compression makes more sense.

Not for Windows users (yes, it exists :p)

Thank you for your feedback, we will take it into account, it is much appreciated. Notably we will remove the spaces in header, will update the documentation, and will change the filtering in simple files.

we could use a FDR correction over all analyses

I support the correction for multiple testing.

So, it is as for presence download, it is at most one row for a given gene–stage–anatomy combination.

I thought all rows in the "Over-/Under-expression across anatomy" download were for "post-juvenile adult stage", so row uniqueness depends only on gene and anatomy?

lots of these structures are not independent (e.g., "cerebellum" is not independent from "brain"). I don't know what type of analyses you plan, but this can sometimes be problematic.

We enforce uniqueness for the metanodes where we are predicting edges (compounds and diseases). We are not planning on eradicating term overlap for other metanodes such as GO Domains, Anatomy, and Symptom. The consequence of this duplicity is unknown and something that HNEP researchers should eventually confront.

Our method downweights paths through high-degree nodes, which reduces the impact of highly-redundant supernodes such as "anatomical entity", "anatomical structure", and "anatomical system". However, our implementation has not been optimized and may get bogged down by all these expression edges. Therefore we may consider removing anatomies that are too broad to be meaningful. We could also consider pruning for independence.

If you need it, I can provide you a script accepting a list of organs as argument, and returning only the most-precise "independent" organs.

Yes, I don't need it but want to see the script out of interest. Primarily, I am curious about how your algorithm, since I often encounter these problems.

Bgee discussion migrating

I started a designated discussion for Bgee. Please direct further Bgee attention there.

 
Status: Completed
Views
825
Topics
Referenced by
Cite this as
Daniel Himmelstein, Frederic Bastian (2015) Tissue-specific gene expression resources. Thinklab. doi:10.15363/thinklab.d81
License

Creative Commons License

Share