Project:
Rephetio: Repurposing drugs on a hetnet [rephetio]

Workshop to analyze LINCS data for the Systems Pharmacology course at UCSF


Today, I'm teaching a workshop for the Systems Pharmacology course at UCSF. The course primarily consists of first year students in the Pharmaceutical Sciences and Pharmacogenomics graduate program.

The topic of my workshop is "Big data". Therefore, I thought a perfect activity would be to analyze the transcriptional perturbation data from LINCS L1000. And stars have aligned: first, we've just released version 2 of our consensus signatures; second, we recently noticed some counterintuitive occurrences in the genetic perturbation data.

Hence, I've designed a set of questions. Each pupil will be assigned a question. The pupils will then use R to attempt to answer the question. At the end of the three hour workshop, we will encourage pupils to post their findings as a comment on this discussion.

I'm hoping to teach my R best practices as well as introduce several packages for modern data science. We will strive for the following workflow in R (not every step is needed for each question):

  1. Read the appropriate file into a dataframe using readr. The readr::read_tsv() function should come in handy. Datasets are available on GitHub (readr should be able to read from the raw dataset URL).
  2. Tidy the dataframe using tidyr. The tidyr::spread() function will help convert the wide (matrix) format to a long format.
  3. Manipulate the dataframe using dplyr. Common operations here will be dplyr::filter and dplyr::mutate.
  4. Join dataframes using dplyr::inner_join() or dplyr::left_join().
  5. Answer the question, either by using dplyr::group_by() followed by dplyr::summarize() or by using ggplot2 to visualize the results.

Questions will follow!

Questions

Here are the 13 questions for the workshop. They all focus on understanding the transcriptional response to genetic perturbation.

  1. How many knockdowns significantly downregulated their target gene (expected)? How many knockdowns significantly upregulated their target gene (unexpected)?

  2. How many overexpressions significantly upregulated their target gene (expected)? How many overexpressions significantly downregulated their target gene (unexpected)?

  3. How many genes were never significantly dysregulated by any knockdown perturbation? Report this number for both the measured and inferred gene sets.

  4. How many genes were never significantly dysregulated by any knockdown perturbation? Report this number for both the measured and inferred gene sets. (same as 3 by accident)

  5. Which ten genes were most frequently significantly downregulated by gene knockdowns? How many knockdowns significantly downregulated these genes? How many knockdowns significantly upregulated these genes?

  6. Which ten genes were most frequently significantly upregulated by gene knockdowns? How many knockdowns significantly upregulated these genes? How many knockdowns significantly downregulated these genes?

  7. Which ten genes were most frequently significantly downregulated by gene overexpression? How many overexpressions significantly downregulated these genes? How many overexpressions significantly upregulated these genes?

  8. Which ten genes were most frequently significantly upregulated by gene overexpression? How many overexpressions significantly upregulated these genes? How many overexpressions significantly downregulated these genes?

  9. For knockdown perturbations, what is the correlation between number of significantly down and upregulated measured genes.

Dataset documentation will follow.

Datasets

The above questions can all be answered using the following three datasets.

Gene information genes.tsv

This dataset contains which genes have regulation scores. It also contains whether the gene's expression was directly measured or imputed. The raw dataset is available at:

https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/genes.tsv

Below is a preview:

entrez_gene_idstatussymboltype_of_genedescription
100imputedADAprotein-codingadenosine deaminase
1000imputedCDH2protein-codingcadherin 2, type 1, N-cadherin (neuronal)
10000imputedAKT3protein-codingv-akt murine thymoma viral oncogene homolog 3

Genes dysregulated by knockdowns dysreg-knockdown.tsv

This dataset contains significantly dysregulated genes due to knockdown perturbations. The raw dataset is available at:

https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-knockdown.tsv

Below is a preview:

perturbagenentrez_gene_idz_scoresymbolstatusdirectionnlog10_bonferroni_pval
2133-5.495ADMimputeddown3.596
2501-4.317ALDH7A1measureddown1.811
29915-5.579ARNT2measureddown4.626

Genes dysregulated by overexpressions dysreg-overexpression.tsv

This dataset contains significantly dysregulated genes due to overexpression perturbations. The raw dataset is available at:

https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-overexpression.tsv

Below is a preview:

perturbagenentrez_gene_idz_scoresymbolstatusdirectionnlog10_bonferroni_pval
2991-4.687CDC20measureddown2.567
2544384.551GFOD1measuredup2.282
259504.590RBP4imputedup1.541

Question 2

I answered:

How many overexpressions significantly upregulated their target gene (expected)? How many overexpressions significantly downregulated their target gene (unexpected)?

R code:

# workshop with dan himmelstein

# which genes have regulation scores. It also contains whether the gene's expression was directly measured or imputed

path <- 'https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/genes.tsv'
gene_df <- readr::read_tsv(path)

# significantly dysregulated genes due to knockdown perturbations

path2 <- 'https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-knockdown.tsv'
kd_df <- readr::read_tsv(path2)

# significantly dysregulated genes due to overexpression perturbations

path3 <- 'https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-overexpression.tsv'
oe_df <- readr::read_tsv(path3)

dat <- filter(oe_df, perturbagen == entrez_gene_id)

dat %>%
  dplyr::group_by(direction) %>%
  dplyr::summarize(
    count = n()
  )

R output:

Source: local data frame [2 x 2]

  direction count
      (chr) (int)
1      down     4
2        up   124
/

So, 4 genes were significantly downregulated after being overexpressed (unexpected) and 124 genes were significantly upregulated after being overexpressed (expected).

Thanks Dan!

For question 3,

How many genes were never significantly dysregulated by any knockdown perturbation? Report this number for both the measured and inferred gene sets.

The elegant dplyr solution (thanks Daniel) looks like:

#how many times are genes disregulated in all?
count_df = knockdown_df %>%
  dplyr::group_by(entrez_gene_id) %>%
  dplyr::summarise(count=n())

#join the table of counts with the full table of genes. the genes that were not present
#are automatically converted to missing data
full=gene_df %>% 
  dplyr::left_join(count_df)

#divide the missing data by imputed vs. measured
result = full %>% dplyr::filter(is.na(count)) %>% 
  dplyr::group_by(status) %>% 
  dplyr::summarise(count=n())

The solution: of all the genes, very few avoid disregulation!

    status count
     (chr) (int)
1  imputed    55
2 measured     1

Question 1

How many knockdowns significantly downregulated their target gene (expected)? How many knockdowns significantly upregulated their target gene (unexpected)?

Here is the Code :)

path = 'https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-knockdown.tsv'
gene_kd = readr::read_tsv(path)

gene_kd %>%
  dplyr::filter(perturbagen == entrez_gene_id) %>%
  dplyr::group_by(direction) %>%
  dplyr::summarize(
    count = n()
  )

Output:

  direction count
      (chr) (int)
1      down   806
2        up     9

Conclusion: Of the knockdown genes, 806 significantly downregulated their gene (expected) while 9 upregulated their gene (unexpected)

Question 6

knock_down_path =    "https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/con    sensi/signif/dysreg-knockdown.tsv"
kd_genes = readr::read_tsv(knock_down_path)
kd_genes$direction = as.factor(kd_genes$direction)

kd_genes %>%
  group_by(symbol, direction) %>%
  dplyr::summarise(count=n()) %>%
  tidyr::spread(key = direction, value = count, fill = 0) %>%
  arrange(desc(up)) %>% top_n(n = 10, wt = desc(up))

Resulting Table

symboldownup
MCOLN121128
MAL0985
WIF10884
SERPINA30873
SATB10862
CES10849
XIST30764
CRIP10713
KLHL212602
COL11A10562
TF0527
ERAP20512
ABCC53501
AGR22478
CPVL1476

Notes

These are the top 10 most unregulated genes. These up-regulated genes do not appear to be down-regulated with any significant frequency

Question 5

Which ten genes were most frequently significantly downregulated by gene knockdowns? How many knockdowns significantly downregulated these genes? How many knockdowns significantly upregulated these genes?

Here's my code:

path="https://raw.githubusercontent.com/dhimmel/lincs/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/genes.tsv"
path2="https://raw.githubusercontent.com/dhimmel/lincs/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-knockdown.tsv"
path3="https://raw.githubusercontent.com/dhimmel/lincs/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-overexpression.tsv"
gene_df = readr::read_tsv(path)
kd_gene = readr::read_tsv(path2)
oexp_gene= readr::read_tsv(path3)
head(gene_df)
head(kd_gene)
View(kd_gene)
library(dplyr)
library(tidyr)

gene_df %>%
  dplyr::group_by(status) %>%
  dplyr::summarize(
    count=n()
  )

gene_df %>% 
  dplyr::mutate(kind='gene')

#which 10 genes were most frequently dowregulated by KDs

#first, find number of distinct genes downregulated by KDs (7411)
kd_gene$entrez_gene_id %>% 
  n_distinct()
#next, find number of pertubagens (4312)
kd_gene$perturbagen %>% 
  n_distinct()

#from the top 10 genes, how many times were they downregulated? 
#genes most frequently DOWNREGULATED by the KNOCKDOWNS


#filter to only downregulated KDs
downregulated_kds <- kd_gene %>% 
  filter(direction=="down")
#sort by count to downregulated KDs
downreg_kd_sorted <- downregulated_kds %>%
  dplyr::group_by(symbol) %>%
  dplyr::summarise(
    count=n()
  ) %>%
  dplyr::arrange(desc(count))
head(downreg_kd_sorted, 10)
View(downreg_kd_sorted)

#from the top 10 genes, how many times were they UPREGULATED? 
#genes most frequently UPREGULATED by the KNOCKDOWNS

#filter to only upregulated KDs
upregulated_kds <- kd_gene %>% 
  filter(direction=="up")
#sort by count to upregulated KDs
upreg_kd_sorted <- upregulated_kds %>%
  dplyr::group_by(symbol) %>%
  dplyr::summarise(
    count=n()
  ) %>%
  dplyr::arrange(desc(count))
head(upreg_kd_sorted, 10)
View(upreg_kd_sorted)


#How many knockdowns downregulated these genes? 195,786
#How many knockdowns upregulated these genes? 132,282
nrow(kd_gene)
kd_gene %>%
  dplyr::group_by(direction) %>%
  dplyr::summarize(
    count=n()
  )

upreg_kd_sorted<- upreg_kd_sorted %>% 
  dplyr::rename(up_count=count)
dim(upreg_kd_sorted)
View(upreg_kd_sorted)

downreg_kd_sorted <-downreg_kd_sorted %>%
  dplyr::rename(down_count=count)
dim(downreg_kd_sorted)

MYANSWER<- dplyr::full_join(downreg_kd_sorted, upreg_kd_sorted)

MYANSWER[is.na(MYANSWER)] = 0
MYANSWER

Here's my answer:

symboldown_countup_count
RPS4Y116370
CDC2014561
PCNA13600
NME111820
MIF10520
CSRP110311
STUB199610
TIMM99894
TYMS8810
GDF158660

Question 7, 8, and 9

#read in data 
path_genes = 'https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/genes.tsv'
gene_df = readr::read_tsv(path_genes)

path_down = 'https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-knockdown.tsv'
down_df = readr::read_tsv(path_down)

path_over = 'https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-overexpression.tsv'
over_df = readr::read_tsv(path_over)

Question 7- Emmalyn Chen

q.7 = over_df %>% subset(z_score < 0) %>% group_by(entrez_gene_id) %>% summarize(count=n()) %>% arrange(-count)
q.7 = q.7[1:10,]

a. Which ten genes were most frequently significantly downregulated by gene overexpression?

entrez_gene_id count
            (int) (int)
1            6192   166
2             991   165
3            5111   165
4            5018   122
5             994   105
6           26520   102
7            1738    91
8            9133    86
9            7298    84
10          22827    83

b. How many overexpressions significantly downregulated these genes?

q.7.2 = over_df %>% subset(z_score < 0) %>% filter(entrez_gene_id %in% q.7$entrez_gene_id) %>% 
  group_by(perturbagen) %>% summarize(count = n())

612 overexpressed genes

c. How many overexpressions significantly upregulated these genes?

q.7.3 = over_df %>% subset(z_score > 0) %>% filter(entrez_gene_id %in% q.7$entrez_gene_id) %>% 
  group_by(perturbagen) %>% summarize(count = n())

4 overexpressed genes

Question 8 - Liz Levy

q.8 = over_df %>% subset(z_score > 0) %>% group_by(entrez_gene_id) %>% summarize(count=n()) %>% arrange(-count)
q.8 = q.8[1:10,]

a. Which ten genes were most frequently significantly upregulated by gene overexpression?

entrez_gene_id count
            (int) (int)
1           57192   180
2            5331   152
3           25966   140
4           23378   113
5            4118   104
6            9903    99
7           55008    98
8            1066    96
9            5971    94
10           7503    94

b. How many overexpressions significantly upregulated these genes?

q.8.2 = over_df %>% subset(z_score > 0) %>% filter(entrez_gene_id %in% q.8$entrez_gene_id) %>% 
  group_by(perturbagen) %>% summarize(count = n())

792 overexpressed genes

c. How many overexpressions significantly downregulated these genes?

q.8.3 = over_df %>% subset(z_score < 0) %>% filter(entrez_gene_id %in% q.8$entrez_gene_id) %>% 
  group_by(perturbagen) %>% summarize(count = n())

14 overexpressed genes

Question 9 - Marjorie Imperial

For knockdown perturbations, what is the correlation between number of significantly down and upregulated measured genes.

q.9.down.reg = down_df %>% subset(z_score < 0) %>% group_by(perturbagen) %>% summarize(count.down.reg = n())
q.9.up.reg = down_df %>% subset(z_score > 0) %>% group_by(perturbagen) %>% summarize(count.up.reg = n())

joined_df = dplyr::full_join(q.9.down.reg, q.9.up.reg)
joined_df[is.na(joined_df)] = 0

Pearson correlation, R =0.9371317

cor(joined_df$count.down.reg, joined_df$count.up.reg)

Kendall correlation R = 0.732456

cor(joined_df$count.down.reg, joined_df$count.up.reg, method = 'kendall')

See Question 7 above.

Question 4

How many genes were never significantly dysregulated by any knockdown perturbation?

library(dplyr)
install.packages("tidyr")
library(readr)
library(ggplot2)

path = 'https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/genes.tsv'
path_ko = 'https://github.com/dhimmel/lincs/raw/abcb12f942f93e3ee839e5e3593f930df2c56845/data/consensi/signif/dysreg-knockdown.tsv'
gene_df = readr::read_tsv(path)
gene_ko_df = readr::read_tsv(path_ko)

ghost_df = gene_df[! (gene_df$entrez_gene_id %in% gene_ko_df$entrez_gene_id), ]
nrow(ghost_df)

#for a list of these genes
ghost_df$symbol

The number of genes that were not sig dysregulated by knockdown perturbation (on main list of genes, but not on knockdown list of genes) = 56!

See question 8 above.

Closing remarks

Impressive work!

Each of the nine pupils in attendance answered their question. Most finished within two hours — despite several having little R experience — after an initial 30 minute tutorial. The workshop succeeded at introducing a broad range of topics: R, the hadleyverse, transcriptomics, LINCS L1000, markdown, Thinklab, and open science.

I enjoyed helping the pupils learn while they performed original and noteworthy analyses. And meanwhile, through the power of realtime open science on Thinklab, we're now coauthors on a citeable work [1].

The workshop built off of many developments in scientific education: specifically, solving problems [2] in contemporary research [3] while contributing to the scientific record [4].

Next, I'll review the answers to see what we have learned.

Workshop conclusions

Here's my analysis of the answers from today's workshop. Thanks again to the pupils for their hard work.

Do target genes of genetic perturbation respond in the expected direction?

Yes, we established this important control. Knockdown overwhelming downregulated (806 instances) rather than upregulated (9 instances) its target gene (Q1 by @jeffreykim). Overexpression overwhelming upregulated (124 instances) rather than downregulated (4 instances) its target gene (Q2 by @kathleenk).

Are the many genes that never respond to genetic perturbation?

No, we saw that almost all genes were dysregulated by at least one genetic perturbation. Only 0.7% of genes (56 out of 7,467) were never dysregulated by a knockdown (Q4 by @jasleensodhi). Only 1 of these genes was measured, while the remaining 55 were imputed (Q3 by @mishavysotskiy). This imbalance makes sense since imputed genes were subject to a more stringent significance threshold. The low number of never-dysregulated genes is a welcome result from a network perspective, where pervasive connectivity is important.

Which genes are most frequently dysregulated?

Next, we identified which genes were most frequently dysregulated due to a knockdown. RPS4Y1 was downregulated by 37.8% (1,637 out of 4,326) of knockdowns (Q5 by @juliacluceru). MCOLN1 was upregulated by 26.1% (1,128 out of 4,326) of knockdowns (Q6 by @beaunorgeot). The top-ten-most-frequently-downregulated-by-knockdown genes were rarely upregulated by knockdown. The same consistency in direction of dysregulation applied to the top-ten-most-frequently-upregulated genes as well.

Next, we identified which genes were most frequently dysregulated due to overexpression. RPS4Y1 was downregulated by 6.9% (166 out of 2,413) of overepressions (Q7 by @emmalynchen). MCOLN1 was upregulated by 7.5% (180 out of 2,413) of overepressions (Q8 by @elizabethlevy1). Interestingly, RPS4Y1 was the most downregulated gene by both knockdown and overexpression. Conversely, MCOLN1 was the most upregulated gene for both perturbation types.

The findings from Q5–8 fit with @larsjuhljensen's hypothesis that a general stress response may cause many genes to respond to any genetic perturbation in a consistent direction. Q5–8 also help address @caseygreene's question on which genes are driving the signals.

Does broad downregulation occur in tandem with broad upregulation?

Finally, there was a strong correlation between the number of downregulated and upregulated genes per knockdown (Q9 by @marjorieimperial). In other words, a perturbation which downregulates many genes will also likely upregulate many genes.

  • Lars Juhl Jensen: Reassuring to see that things behave the way I would expect. This should make it fairly easy to derive a scoring scheme that extracts only associations that are specifically associated with a small number of perturbations, as opposed to associated with any perturbation.

  • Lars Juhl Jensen: The part about broad downregulation occurring in tandem with broad upregulation is almost a given. Even if it is not the case biologically, this will be the case after most normalization methods.

 
Status: Completed
Views
183
Topics
Referenced by
Cite this as
Daniel Himmelstein, Kathleen Keough, Misha Vysotskiy, Jeffrey Kim, Beau Norgeot, Julia Cluceru, Marjorie Imperial, Emmalyn Chen, Jasleen Sodhi, Elizabeth Levy (2016) Workshop to analyze LINCS data for the Systems Pharmacology course at UCSF. Thinklab. doi:10.15363/thinklab.d181
License

Creative Commons License

Share