#!/usr/bin/env python
# coding: utf-8

# # Introduction to RNA sequencing and data display in Python

# ## This module defines differential expression and teaches the code to make some common displays for RNA seq data including heat map, cluster map, and violin plot. 

# It's hard to believe, but a cell in a liver has the exact same set of DNA instructions as a neuron.  The only difference is in how those genes are expressed. Liver contains liver specific cells with entire programs of development and metabolism that are based on their identity as liver cells and vice versa for neurons. 
# 
# RNA-seq lets us measure this "activity identification" by measuring the abundance of the RNA transcript as a proxy to indicate activity in a cellular process. mRNA samples are isolated and sequenced, creating short fragments of DNA that line up more with one gene or another in a cDNA library that serves as a reference point. More actively expressed genes associate in greater numbers with the cDNA library and fluoresce more brightly to a camera.  The brightness or dimness is translated to numbers that become the matrix of data we analyze: differential gene expression.
# 
# 

# ![RNA-seq.png](attachment:RNA-seq.png)

# The data we will play with today is for demonstration purposes only. It has ten different experimental parameters. In our lab, we collect tissue around the vocal fold at different stages of mouse development below, so you can think of each ROW in our heatmap is a different stage of development in a mouse embryo below. "PO" means postnatal, the mouse is born!
# 
# (Image from: Badea A, Johnson GA. Magnetic resonance microscopy. Anal Cell Pathol (Amst). 2012;35(4):205-27. doi: 10.3233/ACP-2011-0050. PMID: 22142643; PMCID: PMC3508709.)

# ![Mid-sagittal-slices-from-volume-datasets-of-developing-mouse-at-stages-E105-E185.png](attachment:Mid-sagittal-slices-from-volume-datasets-of-developing-mouse-at-stages-E105-E185.png)

# In[126]:


import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')


# Next we'll pull up the sequence data from our repository.  This is a tiny sample set of 99 genes, so we can store the data on Github, which is a place where computer programmers use and share programming code. In real life, scientists use special servers to hold the RNA sequencing data, which contains a little over 25,000 genes for each experimental parameter!  At the University of Wisconsin-Madison, many of those servers are in the basement of the Morgridge Institute for Research and/or the Discovery Center.

# In[127]:


df = pd.read_csv('https://raw.githubusercontent.com/kwendt3/Fourier/main/1geneexpress.csv')


# Let's take a look at our file, which shows a MATRIX of data, the framework computers need to look at data in flux.  Our matrix values are variance values that are determined by a baseline in the mouse genome.  At the top, we see our 99 sample genes. Each column represents a different gene, one of 99 for this study, and row represents a different location in the tissue, or perhaps a different timepoint that tissue was collected, one of ten in this study. The command "df.head" is a nice way to just check to make sure our dataframe looks good, and to tell us just how many rows and columns we have in case we need to code around it. Looking great!  Remember, we named our data frame "df" in the line above, but we can call it whatever we want. 

# In[128]:


df.head(99)


# Looking at our data, it's hard to tell visually what's going on.  It just looks like a bunch of NUMBERS. To observe changes in the data over time, scientists convert the numbers to colors that correspond with raising or lowering levels.  This is called a heatmap.  In the command below, we'll generate a heatmap of the entire matrix of data. So higher numbers are beige, and lower numbers are black. 

# In[129]:


ax = sns.heatmap(df, annot=True, annot_kws = {'size':1})
ax.tick_params(labelsize=14)
ax.figure.set_size_inches((12, 10))


# Now that we have a heatmap of the entire matrix, let's take a look at what's really going on in each column of data, since each column represents gene expression.  This is called normalization, because it NORMALIZES variations by a group that you pick out.  In our case, the "group" is/are the gene(s) of interest.  You can change the color scale of your heatmap to whatever you want using the cmap command.  Here are some common choices: https://matplotlib.org/3.1.0/tutorials/colors/colormaps.html.  Now is a good time too to label our Y axis by the day that we collected our data.

# In[130]:


y_axis_labels = ["E10.5","E11.5","E12.5","E13.5","E14.5","E15.5","E16.5","E17.5","E18.5","P0"] # labels for y-axis
df_norm_col=(df-df.mean())/df.std()
plt.figure(figsize = (16,5))
sns.heatmap(df_norm_col, annot=True, annot_kws = {'size':1}, cmap='cividis', yticklabels=y_axis_labels)


# After obtaining lists of genes that were differentially expressed across the ten conditions. we used these genes as input for clustering to define the prevalent patterns of gene expression. A heat map provides a way to visually assess the results of clustering on the data, enabling the investigator and reader to observe trends of expression for genes across populations, treatment conditions, or time points

# In[131]:


x_axis_labels = ["E10.5","E11.5","E12.5","E13.5","E14.5","E15.5","E16.5","E17.5","E18.5","P0"] # labels for x-axis
g = sns.clustermap(df_norm_col, xticklabels=y_axis_labels, cmap='cividis')


# In[132]:


df.head(99)


# So, we're going to want to see how differential gene expression is distributed for our ten experimental parameters (0-9) for particular genes (gene_1-gene_99).  To do that, we need our data to be transposed so that the experimental parameters are in the columns, where Seaborn likes things.  Let's introduce the transposition as df2. 

# In[133]:


df2 = pd.read_csv('https://raw.githubusercontent.com/kwendt3/Fourier/main/geneexpress_experiment.csv')


# In[134]:


df2.head(10)


# Notice above that now each gene 1-10 is saved as a row 0-9.

# In[135]:


df2 = df2.melt(var_name='Development', value_name='gene_distribution')
print (df2)


# In[136]:


ax = sns.violinplot(x="Development", y="gene_distribution", data=df2)