In order to demonstrate the use of my script nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016.py
, found here, I'll use it to plot nucleotide difference imbalance for the span of coordinates from the Saccharomyces cerevisiae S288C reference sequence that is similar to Figure 8, panel B of Morrill et al 2016(PMID: 27026700).
This will also demonstrate in the 'Preparation' section use of my script get_chromosomal_coordinates_as_FASTA.py
, found here, to use YeastMine to retrieve a sequence for a specified region of a S. cerevisiae chromosome.
Full reference for the plot style/approach:
References for sequence data and YeastMine/Intermine interface used in demonstration:
If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.
Some tips:
The script needs a sequence file in FASTA format to take as input to scan and then plot the related traits across the sequence. Here we will retrieve a sequence of a span of the yeast chromosomes IV and use that as the input for the script demonstration below.
To do that we'll use my script get_chromosomal_coordinates_as_FASTA.py
, found here, to use YeastMine to retrieve a sequence for a specified region of a S. cerevisiae chromosome.
# Get the script
!curl -OL https://raw.githubusercontent.com/fomightez/yeastmine/master/get_chromosomal_coordinates_as_FASTA.py
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 16021 100 16021 0 0 151k 0 --:--:-- --:--:-- --:--:-- 151k
Get a sequence to work with.
%run get_chromosomal_coordinates_as_FASTA.py IV 200000-220000
Sequence on Watson strand specified...retrieving sequence from chromosome IV...making FASTA formatted entry with retrieved sequence... File of genomic sequence saved as 'S288C_chrIV_200000-220000_genomic.fsa'. Finished.
Now we have a sequence to use to plot.
There is also a dependency for the plotting script that isn't currently installed by default in this environment. (Whether that is already installed could change at a later point and then there is still no harm running this still, and it will just say something to the effect that it already satisfied.)
#install a necessary dependency
!pip install pyfaidx
Collecting pyfaidx Downloading https://files.pythonhosted.org/packages/75/a5/7e2569527b3849ea28d79b4f70d7cf46a47d36459bc59e0efa4e10e8c8b2/pyfaidx-0.5.5.2.tar.gz Requirement already satisfied: six in /srv/conda/lib/python3.7/site-packages (from pyfaidx) (1.12.0) Requirement already satisfied: setuptools>=0.7 in /srv/conda/lib/python3.7/site-packages (from pyfaidx) (40.8.0) Building wheels for collected packages: pyfaidx Building wheel for pyfaidx (setup.py) ... done Stored in directory: /home/jovyan/.cache/pip/wheels/54/a2/b4/e242e58d23b2808e191b214067880faa46cd2341f363886e0b Successfully built pyfaidx Installing collected packages: pyfaidx Successfully installed pyfaidx-0.5.5.2
This script gets a sequence from a sequence file in FASTA format. It can be either a single sequence or more. You provide an indentifier to specify which sequence in the multiFASTA file to mine. In fact you always need to provide something for the indentifier parameter when calling this script or the main function of it, but that text can be nonseniscal if there is only one sequence in the sequence file. It disregards anything provided if there is only one.
The only other thing necessary is providing start and end positions to specify the subsequence. Positions are to be specified in typical position terms where the first residue is numbered one.
# Get the script
!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/plot_nt_imbalance/nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016.py
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 20587 100 20587 0 0 148k 0 --:--:-- --:--:-- --:--:-- 148k
!python nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016.py -h
usage: nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016.py [-h] [-bl BLOCK_SIZE] [-ov OVERLAP_SIZE] [-svg] SEQUENCE_FILE CHR START_POS nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2 016.py makes a plot figure like Figure 8, panel B of Morrill et al 2016 (PMID: 27026700), see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4882425/figure/F8/. **** Script by Wayne Decatur (fomightez @ github) *** positional arguments: SEQUENCE_FILE Name of sequence file to use as input. Must be FASTA format. Can be a multi-FASTA file, i.e., multiple sequences in FASTA format in one file. CHR Designation of chromosome / contig/ scaffold / region corresponding to that in sequence file. To be used for plot label. No spaces. Feel free to prefix with a species indicator as well . START_POS Coordinate position that corresponds to first position of specified along chromosome. To be used for label of x-axis. No spaces. optional arguments: -h, --help show this help message and exit -bl BLOCK_SIZE, --block_size BLOCK_SIZE OPTIONAL: Use the `--block_size` flag followed by an interger to provide a value to use as the span size (window of basepairs) to analyze instead of the default of '150'. -ov OVERLAP_SIZE, --overlap_size OVERLAP_SIZE OPTIONAL: Use the `--overlap_size` flag followed by an integer to specify the amount of overlap to use between the analysis windows instead of the default of '50'. -svg, --save_vg add this flag to save as vector graphics (**RECOMMENDED FOR PUBLICATION***) instead of default png. Not default or saved alongside default because file size can get large due to the large number of points.
Next, we'll use that to plot the sequence data we obtained during 'Preparation'.
This will run it as if it was on the command line. (Note if this was indeed on the command line, the command would be the same with the exception of !
at the start. The exclamation point is a Jupyter/IPython specific signal to run the command following it as if it was on the command line.)
In addition to calling the script, we provide it the name of the sequence file to use and a couple of items to use in labels. The items provided cannot have spaces, and you'll note that those are actually present in the sequence name that happened to be made. However, the script itself doesn't harvest that information so that the script is more general.
!python nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016.py S288C_chrIV_200000-220000_genomic.fsa chrIV 200000
Plot image saved to: S288C_chrIV_200000-220000_genomic_nt_imblance_plot.png
Because we are in a Jupyter environment we can use code to view that generated image here.
from IPython.display import Image
Image("S288C_chrIV_200000-220000_genomic_nt_imblance_plot.png")
That process above was meant to better mirror how you'd run it on the command line. If actually were using this script in a Jupyter notebook, you could simply run the command in the next cell with the Jupyter magics %run
command and the resulting image will get shown in the notebook without need for additional code like in the last cell. (Note that it make take rerunning again this cell in order for it display the plot.)
Like so:
%run nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016.py S288C_chrIV_200000-220000_genomic.fsa chrIV 200000
Plot image saved to: S288C_chrIV_200000-220000_genomic_nt_imblance_plot.png
The size of the window of basepairs analyzed and the amount of overlap of those windows can be changed from the default settings by putting values after the signaling flags of --block_size
and --overlap_size
, respectively. For example:
%run nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016.py S288C_chrIV_200000-220000_genomic.fsa chrIV 200000 --block_size 80 --overlap_size 30
Plot image saved to: S288C_chrIV_200000-220000_genomic_nt_imblance_plot.png
Remember, you may need to run that cell above twice for the result to show up.
You'll note subtle differences from the earlier plots indicating the different settings were used. One particular place to look is the very first section of the plot on the left side.
Because we are in a Jupyter environment, we could have used code to extract the details used in the last two required arguments of the command call, specifically chrIV
and 200000
, from the file name. I'll illustrate that for this example because it could be handy if processing multiple regions.
#Use fnmatch to make a list of files ending in `.fsa` and the work through them splicing out useful bits to make plot command
import os
import sys
import fnmatch
name_part_to_match = ".fsa"
associated_fns= []
for file in os.listdir('.'):
if fnmatch.fnmatch(file, '*'+name_part_to_match):
associated_fns.append(file)
sys.stderr.write("Files ending in `{}` identified.\nReading and plotting each...".format(name_part_to_match))
# Interate over list to process name and fire off commands
for fn in associated_fns:
chr_name = fn.split("_")[1]
start_num = fn.split("_")[2].rsplit("-")[0]
!python nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016.py {fn} {chr_name} {start_num}
sys.stderr.write(f"Processed assembled command:\n 'python nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016.py {fn} {chr_name} {start_num}' ")
sys.stderr.write("\nEnd of list of files ending in `{}`reached.\nDone.".format(name_part_to_match))
Files ending in `.fsa` identified. Reading and plotting each...
Plot image saved to: S288C_chrIV_200000-220000_genomic_nt_imblance_plot.png
Processed assembled command: 'python nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016.py S288C_chrIV_200000-220000_genomic.fsa chrIV 200000' End of list of files ending in `.fsa`reached. Done.
Although you may wish to further simplify things by importing the main function of this script and provide the details as Python objects that way. This is illustrated at the end of the next section about importing the main function.
The only option other not already illustrated is adding --save_vg
(or the abbreviated form -svg
) to the commands to produce vector graphics that are very useful for further modification by Inkscape or Adobe Illustrator. The bitmapped 'PNG'-style graphics or more universally viewable and so are used here for convenience; however, they don't scale as well as vector graphics.
This will demonstrate importing the main function into a notebook.
# Get the script if the above section wasn't run. Does nothing if it was.
import os
file_needed = "nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016.py"
if not os.path.isfile(file_needed):
!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/plot_nt_imbalance/nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016.py
Let's make it so the equivalent to the example command worked out above and shown again on the line below runs and the plot is returned directly into the notebook environment as a figure object to display in this notebook.
python nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016.py S288C_chrIV_200000-220000_genomic.fsa chrIV 200000
At the end of this section we'll show how the returned notebook can be further modified via the notebook interface when doing this.
First we need to import the main function from the script file.
from nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016 import nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016
Now to try using that to do what was initially done with the sequence data above. The main difference besides the arrangement of the items is that we should specify return_fig=True
.
%matplotlib inline
out = nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016("S288C_chrIV_200000-220000_genomic.fsa", "chrIV", 200000, return_fig=True);
Plot figure object returned.
Note that if we simply had left off return_fig=True
, it would have been a perfectly valid statement. It would have been interpreted as set to False
and an image file would be produced just like when calling from the command line.
However, returning the figure as a Python object affords us the opportunity to further modify it within the Jupyter environment, even after the function generated it via the code in the python nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016.py
script. This will be demonstrate below at the end of this section; however before getting to that novel use of the main function and the Jupyter environment, let's revisit the example above where we scripted capaturing of the elements needed in the call to the script from the input file name. If we stay within all Python it becomes slightly easier to do because we don't need to use the brackets to feed python variables into the shell commands.
The next cell recapitulates the approach illustrate at the end of the basic usage on the command line section for how'd you do that in the notebook after importing the main function.
#Use fnmatch to make a list of files ending in `.fsa` and the work through them splicing out useful bits to make plot command
import os
import sys
import fnmatch
name_part_to_match = ".fsa"
associated_fns= []
for file in os.listdir('.'):
if fnmatch.fnmatch(file, '*'+name_part_to_match):
associated_fns.append(file)
sys.stderr.write("Files ending in `{}` identified.\nReading and plotting each...".format(name_part_to_match))
# Interate over list to process name and fire off commands
for fn in associated_fns:
chr_name = fn.split("_")[1]
start_num = int(fn.split("_")[2].rsplit("-")[0])
nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016(fn, chr_name, start_num, return_fig=True)
sys.stderr.write("\nEnd of list of files ending in `{}`reached.\nDone.".format(name_part_to_match))
Files ending in `.fsa` identified. Reading and plotting each...Plot figure object returned. End of list of files ending in `.fsa`reached. Done.
The only difference besides the line calling the function was that we needed to cast start_num
to integer.
The size of the window of basepairs analyzed and the degree of overlap of those windows can be changed when the function is used via import as well, you just need to specify the chunk_size
and overlap_specified
when calling the function. For example:
%matplotlib inline
nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016("S288C_chrIV_200000-220000_genomic.fsa", "CHANGED SETTINGS EXAMPLE", 200000, chunk_size = 80, overlap_specified=30, return_fig=True);
Plot figure object returned.
I mentioned earlier that the figure object returned by the command can be modified subsequently. This allows customization of the image using Python. This notebook will end by illustrating that with the addition of a title, adjustment of the font sizes, and addition of annotation.
Note that running from the command line, we generated an image file that we could modify using image software. (Note: you probably be best to add the --save_vg
in that case as that will make an infinitely scalabe vector graphics file.) However, by using the main function via import and getting back a plot figure object, it can be further modifed still via Python plot commands.
Let's demonstrate by adding a title, adjusting the font size, and adding an annotation.
Just to make sure everything is on the same page we'll run commands like run earlier, this time as a block.
from nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016 import nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016
out2 = nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016("S288C_chrIV_200000-220000_genomic.fsa", "chrIV", 200000, return_fig=True)
Plot figure object returned.
It doesn't matter if we specify to import the main function again; if it was already, that command will be ignored. (However, I added it to make it easier to possibly just to this section after running the preparation section.)
Note we make assign the plot figure produced as out2
to in order to distinguish it from earlier. Let's modify out2
.
First will add a title.
# add a title.
out2.set_title("Plot title goes here")
Text(0.5, 1.0, 'Plot title goes here')
Hmmm. No plot but there is some indication something happened. If we had added ;
, we could suppress that and then subsequently view the result.
out2.set_title("Plot title goes here");
So when suppressed because of including ;
so there is no last real command to show output for in the output cell, we get no indication anything happened. To view the plot we need to call the figure attribute of the object.
out2.figure
Thus, we see it worked.
Let's make some of the fonts larger as further customization.
out2.set_title("Plot title goes here", fontsize = 30) # or `out2.set_title(fontsize = 30);`
out2.legend(fontsize = 14);
You could view the plot figure with out2.figure
again to make sure it changed; however, let's assume for now it did and make one more change. Let's add an annotation. And then view the result.
import matplotlib.patches as patches
#define settings
patch_left_x = 205360
patch_width = 597 #length of CTD coding to match Corden 2013.
patch_bottom_y = -59.5
patch_height = 104.0
offset_to_center_better = 74
CTDmidpoint = (patch_left_x + (patch_left_x +patch_width))/2 + offset_to_center_better
bottom_region_plot = -28
# apply annotations
out2.add_patch(patches.Rectangle(
(patch_left_x, patch_bottom_y), patch_width,patch_height,fill=True,facecolor=(0.48,0.48,0.48,0.2),
edgecolor=(0.3,0.3,0.3,0.03),lw=0,clip_on=False))
t = out2.text(CTDmidpoint, bottom_region_plot, "CTD Repeat Region", ha="center", va="center", color=(0.28,0.28,0.28),
rotation=90,size=11)
out2.figure
You can display the original output from prior with the following command, if it is in memory, and note that it doesn't have these changes.
out.figure # ORIGINAL FROM EARLIER
Or you could further modify by doing things such as stretch it out horizontally like a tapestry with out2.figure.set_size_inches((58, 10))
(that is from https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/figure.py and
seemed to be the trick I could find nowhere uncluttered to allow to adjust size of seaborn
lineplots after intialy produced; doesn't seem to require forward=True
in Jupyter
if call again with out2.figure
subsequently). That will be left as an exercise for the reader.
You'll want to save the new version of the plot as an image file, like so:
out2.figure.savefig("the new plot file.png") # or if you want it with less whitespace around it,
# try `out2.figure.savefig("the new plot file.png", bbox_inches='tight')`
Bear in mind you'd use something like out2.figure.savefig("the new plot file.svg")
if you wanted it as vector graphics which would be better for scaling.
Save the image file file to your local machine.
Enjoy!