Sean's answers
I've downloaded Wiggins' data files to the same directory as my jupyter notebook: w07-data.1, w07-data.2, and w07-data.3.
I've also installed R, BioConductor, and edgeR on my machine, following the instructions in the pset.
We're supposed to write a function that takes the name of an input counts file as an argument, and returns the results of an edgeR
analysis.
I'm going to go ahead and include a do_tmm
option, to ask for an optional TMM normalization step. Wiggins didn't do this in his analysis script, and that's going to be the answer to part (4). Other than that, I'm just recapitulating his analysis steps -- just using Python to write an R script as a temporary file that I can run (in R) with Rscript
, then use python to parse the results file back in. Communication through files: it's not pretty but it gets the job done.
My groups
argument to my function is a little too chummy with my infile
argument: to call the function, you have to know how many data columns are in the infile
, and define the groups
list accordingly with a condition for each data column. This is fine for a rough&ready analysis -- especially since in this pset we've contrived a case where we don't know what the condition labels are on Wiggins' input files, because Wiggins lost them. If you were really writing something stable, it would be better to define the format of your infile
's to include the condition information.
I'm also being sloppy about hardcoding the script file name tmp.r
into my function. This would bite me eventually, in real use. If I run my function in two different scripts at the same time in the same directory, their files will clobber each other unpredictably. Better, I could provide a prefix
argument to my function, and have it create all its temporary files so they start with that prefix; and I could have my main program make sure that it uses a unique prefix. This is something you have to worry about when you're communicating through files. Someone else might try to talk at the same time, and you have to make sure you have distinct channels.
def run_edgeR(infile, groups, do_tmm, outfile):
"""
Run an edgeR analysis on the data in file <infile>, assigning condition labels <groups> to its columns.
Include a TMM normalization step in the analysis if <do_tmm> is True.
Store edgeR's results table in <outfile>.
Also return results as arrays for <genename>, <logFC>, <pvalue>, and <fdr>.
Input:
infile = tab-delimited file containing C+1 columns; 1st column is gene name; followed by C columns of count data
First row consists of column headers, not data.
groups = string of <C> condition labels for each column; example "1,1,1,2,2,2" for data where the first three
columns are from one condition, last three columns are from another. edgeR will treat identical
condition labels as replicates.
do_tmm = False to do Wiggins' analysis; True to do ours, including a TMM normalization step.
outfile = name of file to send edgeR results table to
Output:
genename[0..N-1]: array of gene names, sorted by P value
logFC[0..N-1]: array of log_2 fold changes
pvalue[0..N-1]: P values
FDR[0..N-1]: false discovery rates at each rank
and <outfile> contains the edgeR table.
"""
# Create an R script to run edgeR, as a temporary file.
#
with open("tmp.r", "w") as f: # semi-danger: we've hardcoded the script file name "tmp.r"
print('library(edgeR)', file=f)
print('infile <- "{}"' .format(infile), file=f) # R: set some R variables for infile, groups, outfile
print('groups <- factor(c({}))' .format(groups), file=f)
print('outfile <- "{}"' .format(outfile), file=f)
print("x <- read.table(infile, sep='\t', row.names=1)", file=f) # R: parse infile as tab-separated data, 1st row = column headers
print("y <- DGEList(counts=x,group=groups)", file=f) # edgeR: create edgeR's data object, a DGEList
if do_tmm:
print("y <- calcNormFactors(y)", file=f) # edgeR: optional TMM normalization step
print("y <- estimateDisp(y)", file=f) # edgeR: estimate the common dispersion
print("et <- exactTest(y)", file=f) # edgeR: calculate P values, FDRs for differential expression
print("tab <- topTags(et, nrow(x))", file=f) # edgeR: create results table, sorted
print("write.table(tab, file=outfile)", file=f) # R: write output table to `outfile`
# Run the R script
# You can also do this with os.system() from the `os` module, or (better still) the `run` command in python's `subprocess` module
#
!Rscript tmp.r
# The results are now in `tmp.r.out`, so parse them back in
# If we were being careful, we'd check that the Rscript command actually succeeded, rather than assuming it!
#
with open(outfile) as f:
f.readline() # skip header line
genename = []
logFC = []
pvalue = []
fdr = []
for line in f:
fields = line.split()
genename.append(fields[0]) # they still have their double quotes; if I cared, I could strip them off.
logFC.append(float(fields[1]))
pvalue.append(float(fields[3]))
fdr.append(float(fields[4]))
return genename, logFC, pvalue, fdr
"One of these files is not like the others"... we just need to know which of the three data files corresponds to the mutant samples. The other two are Wiggins' wild type samples. So if we make edgeR input files that consist of all possible pairs (1,2), (1,3), and (2,3), one of those combinations should show us essentially no significant differences (those are the two wild type data files), and the other two combinations will show us mutant vs. wt difference.
So first create three different input files the lazy (unix-y) way:
!join -t $'\t' w07-data.1 w07-data.2 > merged.12
!join -t $'\t' w07-data.1 w07-data.3 > merged.13
!join -t $'\t' w07-data.2 w07-data.3 > merged.23
We can look and make sure that we've indeed created these three files in our current working directory, and that they each have six data columns now. For example:
! head -5 merged.12
#gene sample1 sample2 sample3 sample4 sample5 sample6 anise 5761 1420 4412 1789 2514 1780 apricot 237 222 371 15530 33883 13203 artichoke 9304 7973 8404 10935 11200 7343 arugula 996 2523 1409 269717 66118 88089
Let's run the three edgeR
analyses, using Wiggins' steps (i.e. with <do_tmm = False> for our function above).
If anything goes wrong here (and of course it does), the first thing to do is to make sure that you can run Rscript on the command line. Get out of jupyter notebook and try to run the edgeR script, using Rscript tmp.r
(or Rscript analyze_GL.r
using Wiggins' example). Once you get that working, you'll know what you need to fix in your function that's generating the tmp.r
script.
This takes a little time (~20s or so):
groups = "1,1,1,2,2,2" # In each data file, we have three samples from one condition, followed by three from the other.
genes_12, logfc_12, pvalue_12, fdr_12 = run_edgeR("merged.12", groups, False, "out.12")
genes_13, logfc_13, pvalue_13, fdr_13 = run_edgeR("merged.13", groups, False, "out.13")
genes_23, logfc_23, pvalue_23, fdr_23 = run_edgeR("merged.23", groups, False, "out.23")
Loading required package: limma Using classic mode. Loading required package: limma Using classic mode. Loading required package: limma Using classic mode.
Now we can do a pythonic (memory-piggy, horribly inefficient, but terse) way of counting how many FDRs are significant at, say, <0.1:
print("1 vs 2:", len( [ v for v in fdr_12 if v < 0.1]))
print("1 vs 3:", len( [ v for v in fdr_13 if v < 0.1]))
print("2 vs 3:", len( [ v for v in fdr_23 if v < 0.1]))
1 vs 2: 103 1 vs 3: 0 2 vs 3: 106
OK, that's pretty unambiguous: we've identified the two wild type samples, 1 and 3, because they have no significant differences between them at an FDR threshold of $<0.1$. So it's the samples in file 2 that are from mutant.
Why did Wiggins report 2147 significantly differentially expressed genes at P$<0.05$, when it looks like there's only ~100 at FDR $<0.1$? He must've literally thresholded on the per-gene P value, and forgot that he needed to correct for the fact that he's testing about 20K genes. At a P value threshold of 0.05 per test, you expect 5% of your tests to get called positive just by chance: 20K * 5% = 1K.
Let's see what happens when we threshold at the per-gene P value of $<0.05$:
print("1 vs 2:", len( [ v for v in pvalue_12 if float(v) < 0.05]))
print("1 vs 3:", len( [ v for v in pvalue_13 if float(v) < 0.05]))
print("2 vs 3:", len( [ v for v in pvalue_23 if float(v) < 0.05]))
1 vs 2: 2147 1 vs 3: 986 2 vs 3: 2136
Indeed. Wiggins' result came from a comparison of file 1 (wildtype) versus file 2 (mutant).
Our conclusion, using the FDR (which corrects for multiple tests), is better. There are 103 differentially expressed genes between the samples in file 1 (wildtype) and the samples in file 2 (mutant), at FDR $<0.1$. This FDR threshold means that we expect about 0.1 (10%) of our "differentially expressed" gene calls to be false positives, and we need to keep that expected error rate in mind. We think there's really around 90 differentially expressed genes.
Look at how the 1 vs 3 comparison (wild type to wild type) gave us 986 genes with P-values $<0.05$, almost exactly the 5% that we expected. But in the other comparisons of mutant vs. wild type, it's not like we got 1000 false positives + ~90 true positives -- instead, there are about twice as many genes with $P<0.05$ than we expect, if we think there's only about 90 true differentially expressed genes.
A quick browse through the output files shows us that it looks like there's two distinct sets of genes in the 1/2 comparison. There's a set that are very strongly upregulated in the mutant, with log2 fold changes > 5 or so (that's positive, because we have the samples in wildtype vs. mutant order in the merged.12
file and edgeR is telling us that the second group of samples, from mutant, is upregulated relative to the first group of samples, the wildtype). Then there's another set that are weakly downregulated, with log2 fold changes around -2.
You can uncomment the print statements below to see the actual lists, and see that there's indeed a pretty obvious breakpoint in the data.
This is an example where actually looking at the data (and maybe downsampling it) helps us see what's going on.
n = 0
for i in range(len(genes_12)):
if fdr_12[i] < 0.1 and logfc_12[i] > 4:
#print(genes_12[i], logfc_12[i], fdr_12[i])
n += 1
print("{} genes have logFC >4 and FDR < 0.1".format(n))
n = 0
for i in range(len(genes_12)):
if fdr_12[i] < 0.1 and logfc_12[i] <= 4:
#print(genes_12[i], logfc_12[i], fdr_12[i])
n += 1
print("{} other genes have FDR < 0.1".format(n))
50 genes have logFC >4 and FDR < 0.1 53 other genes have FDR < 0.1
(It's also a little suspicious that the first set all have cute vegetable/fruit names like "huckleberry", and the second set all have human-style gene names like "FAM179A", which may give you an artificial clue too, because of how the synthetic data for the pset were constructed...)
In class we talked about TMM normalization: the idea that if the expression level of some genes change per cell (in terms of their absolute transcription rate), the relative abundance of other mRNAs must change in the other direction. If the true expression change is large enough, for enough genes, the indirect effect on the relative abundance of other mRNAs can become statistically significant, if we don't do something to correct for the indirect effect.
Looking at Wiggins' analysis script, he did indeed omit the TMM normalization step, and we have about 50 genes changing their expression by a large amount, so we should worry about it.
Let's include TMM normalization. We already wrote the analysis function to take the <do_tmm> option. We only need to re-analyze files 1 and 2 at this point, to compare wildtype and mutant. (We could aggregate the six wildtype samples in our analysis, which would be even better, but.)
genes_tmm_12, logfc_tmm_12, pvalue_tmm_12, fdr_tmm_12 = run_edgeR("merged.12", groups, True, "out_tmm.12")
Loading required package: limma Using classic mode.
print("1 vs 2:", len( [ v for v in fdr_tmm_12 if v < 0.1]))
1 vs 2: 60
Yes, TMM normalization made a big difference, and we can assume it cleaned things up a lot, compensating for indirect effects.
At FDR $<0.1$ we're still expecting about 6 false positives in our 60 calls. What if we increase the stringency to a point where we don't expect any false positives?
print("1 vs 2:", len( [ v for v in fdr_tmm_12 if v < 0.0001]))
1 vs 2: 50
With 0.01% expected false positives, we still call 50 differentially expressed genes.
If you go and do the same analysis on the 2 vs 3 comparison, you'll see the same set of 50 differentially expressed genes at FDR < 0.0001.
For good reason: because 50 differentially expressed genes is the right answer, that's how the synthetic data were made. The true log2 fold changes are in this file, w07-secrets.