Demo of script to plot nt imbalance for sequence span

In order to demonstrate the use of my script nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016.py, found here, I'll use it to plot nucleotide difference imbalance for the span of coordinates from the Saccharomyces cerevisiae S288C reference sequence that is similar to Figure 8, panel B of Morrill et al 2016(PMID: 27026700).
This will also demonstrate in the 'Preparation' section use of my script get_chromosomal_coordinates_as_FASTA.py, found here, to use YeastMine to retrieve a sequence for a specified region of a S. cerevisiae chromosome.

Full reference for the plot style/approach:

DNA Instability Maintains the Repeat Length of the Yeast RNA Polymerase II C-terminal Domain. Morrill SA, Exner AE, Babokhov M, Reinfeld BI, Fuchs SM. J Biol Chem. 2016 May 27;291(22):11540-50. doi: 10.1074/jbc.M115.696252. Epub 2016 Mar 29.(PMID: 27026700)

References for sequence data and YeastMine/Intermine interface used in demonstration:

The reference genome sequence of Saccharomyces cerevisiae: then and now. Engel SR, Dietrich FS, Fisk DG, Binkley G, Balakrishnan R, Costanzo MC, Dwight SS, Hitz BC, Karra K, Nash RS, Weng S, Wong ED, Lloyd P, Skrzypek MS, Miyasato SR, Simison M, Cherry JM. G3 (Bethesda). 2014 Mar 20;4(3):389-98. doi: 10.1534/g3.113.008995.(PMID: 24374639)

Saccharomyces Genome Database: the genomics resource of budding yeast. Cherry JM, Hong EL, Amundsen C, Balakrishnan R, Binkley G, Chan ET, Christie KR, Costanzo MC, Dwight SS, Engel SR, Fisk DG, Hirschman JE, Hitz BC, Karra K, Krieger CJ, Miyasato SR, Nash RS, Park J, Skrzypek MS, Simison M, Weng S, Wong ED. Nucleic Acids Res. 2012 Jan;40(Database issue):D700-5. doi: 10.1093/nar/gkr1029. Epub 2011 Nov 21. (PMID: 22110037)

YeastMine--an integrated data warehouse for Saccharomyces cerevisiae data as a multipurpose tool-kit. Balakrishnan R, Park J, Karra K, Hitz BC, Binkley G, Hong EL, Sullivan J, Micklem G, Cherry JM. Database (Oxford). 2012 Mar 20;2012:bar062. doi: 10.1093/database/bar062. Print 2012.(PMID: 22434830)

InterMine: extensive web services for modern biology. Kalderimis A, Lyne R, Butano D, Contrino S, Lyne M, Heimbach J, Hu F, Smith R, Stěpán R, Sullivan J, Micklem G. Nucleic Acids Res. 2014 Jul;42(Web Server issue):W468-72. doi: 10.1093/nar/gku301. Epub 2014 Apr 21.(PMID: 24753429)


If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.

Some tips:

  • Code cells have boxes around them. When you hover over them a icon appears.
  • To run a code cell either click the icon, or click on the cell and then hit Shift+Enter. The Shift+Enter combo will also move you to the next cell, so it's a quick way to work through the notebook.
  • While a cell is running a * appears in the square brackets next to the cell. Once the cell has finished running the asterisk will be replaced with a number.
  • In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.
  • To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.


Preparation

The script needs a sequence file in FASTA format to take as input to scan and then plot the related traits across the sequence. Here we will retrieve a sequence of a span of the yeast chromosomes IV and use that as the input for the script demonstration below.
To do that we'll use my script get_chromosomal_coordinates_as_FASTA.py, found here, to use YeastMine to retrieve a sequence for a specified region of a S. cerevisiae chromosome.

In [1]:
# Get the script
!curl -OL https://raw.githubusercontent.com/fomightez/yeastmine/master/get_chromosomal_coordinates_as_FASTA.py
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 16021  100 16021    0     0   151k      0 --:--:-- --:--:-- --:--:--  151k

Get a sequence to work with.

In [2]:
%run get_chromosomal_coordinates_as_FASTA.py IV 200000-220000
Sequence on Watson strand specified...retrieving sequence from chromosome IV...making FASTA formatted entry with retrieved sequence...

File of genomic sequence saved as 'S288C_chrIV_200000-220000_genomic.fsa'.
Finished.

Now we have a sequence to use to plot.

There is also a dependency for the plotting script that isn't currently installed by default in this environment. (Whether that is already installed could change at a later point and then there is still no harm running this still, and it will just say something to the effect that it already satisfied.)

In [3]:
#install a necessary dependency
!pip install pyfaidx
Collecting pyfaidx
  Downloading https://files.pythonhosted.org/packages/75/a5/7e2569527b3849ea28d79b4f70d7cf46a47d36459bc59e0efa4e10e8c8b2/pyfaidx-0.5.5.2.tar.gz
Requirement already satisfied: six in /srv/conda/lib/python3.7/site-packages (from pyfaidx) (1.12.0)
Requirement already satisfied: setuptools>=0.7 in /srv/conda/lib/python3.7/site-packages (from pyfaidx) (40.8.0)
Building wheels for collected packages: pyfaidx
  Building wheel for pyfaidx (setup.py) ... done
  Stored in directory: /home/jovyan/.cache/pip/wheels/54/a2/b4/e242e58d23b2808e191b214067880faa46cd2341f363886e0b
Successfully built pyfaidx
Installing collected packages: pyfaidx
Successfully installed pyfaidx-0.5.5.2

Basic use of the plotting script

This script gets a sequence from a sequence file in FASTA format. It can be either a single sequence or more. You provide an indentifier to specify which sequence in the multiFASTA file to mine. In fact you always need to provide something for the indentifier parameter when calling this script or the main function of it, but that text can be nonseniscal if there is only one sequence in the sequence file. It disregards anything provided if there is only one.
The only other thing necessary is providing start and end positions to specify the subsequence. Positions are to be specified in typical position terms where the first residue is numbered one.

In [4]:
# Get the script
!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/plot_nt_imbalance/nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016.py
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 20587  100 20587    0     0   148k      0 --:--:-- --:--:-- --:--:--  148k

Display USAGE block

In [5]:
!python nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016.py -h
usage:     nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016.py
       [-h] [-bl BLOCK_SIZE] [-ov OVERLAP_SIZE] [-svg]
       SEQUENCE_FILE CHR START_POS

nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2
016.py makes a plot figure like Figure 8, panel B of Morrill et al 2016 (PMID:
27026700), see
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4882425/figure/F8/. **** Script
by Wayne Decatur (fomightez @ github) ***

positional arguments:
  SEQUENCE_FILE         Name of sequence file to use as input. Must be FASTA
                        format. Can be a multi-FASTA file, i.e., multiple
                        sequences in FASTA format in one file.
  CHR                   Designation of chromosome / contig/ scaffold / region
                        corresponding to that in sequence file. To be used for
                        plot label. No spaces. Feel free to prefix with a
                        species indicator as well .
  START_POS             Coordinate position that corresponds to first position
                        of specified along chromosome. To be used for label of
                        x-axis. No spaces.

optional arguments:
  -h, --help            show this help message and exit
  -bl BLOCK_SIZE, --block_size BLOCK_SIZE
                        OPTIONAL: Use the `--block_size` flag followed by an
                        interger to provide a value to use as the span size
                        (window of basepairs) to analyze instead of the
                        default of '150'.
  -ov OVERLAP_SIZE, --overlap_size OVERLAP_SIZE
                        OPTIONAL: Use the `--overlap_size` flag followed by an
                        integer to specify the amount of overlap to use
                        between the analysis windows instead of the default of
                        '50'.
  -svg, --save_vg       add this flag to save as vector graphics
                        (**RECOMMENDED FOR PUBLICATION***) instead of default
                        png. Not default or saved alongside default because
                        file size can get large due to the large number of
                        points.

Next, we'll use that to plot the sequence data we obtained during 'Preparation'.

Plotting example (illustrating command line-based usage)

This will run it as if it was on the command line. (Note if this was indeed on the command line, the command would be the same with the exception of ! at the start. The exclamation point is a Jupyter/IPython specific signal to run the command following it as if it was on the command line.)

In addition to calling the script, we provide it the name of the sequence file to use and a couple of items to use in labels. The items provided cannot have spaces, and you'll note that those are actually present in the sequence name that happened to be made. However, the script itself doesn't harvest that information so that the script is more general.

In [6]:
!python nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016.py S288C_chrIV_200000-220000_genomic.fsa chrIV 200000
Plot image saved to: S288C_chrIV_200000-220000_genomic_nt_imblance_plot.png

Because we are in a Jupyter environment we can use code to view that generated image here.

In [7]:
from IPython.display import Image
Image("S288C_chrIV_200000-220000_genomic_nt_imblance_plot.png")
Out[7]:

That process above was meant to better mirror how you'd run it on the command line. If actually were using this script in a Jupyter notebook, you could simply run the command in the next cell with the Jupyter magics %run command and the resulting image will get shown in the notebook without need for additional code like in the last cell. (Note that it make take rerunning again this cell in order for it display the plot.)
Like so:

In [10]:
%run nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016.py S288C_chrIV_200000-220000_genomic.fsa chrIV 200000
Plot image saved to: S288C_chrIV_200000-220000_genomic_nt_imblance_plot.png

The size of the window of basepairs analyzed and the amount of overlap of those windows can be changed from the default settings by putting values after the signaling flags of --block_size and --overlap_size, respectively. For example:

In [9]:
%run nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016.py S288C_chrIV_200000-220000_genomic.fsa chrIV 200000 --block_size 80 --overlap_size 30
Plot image saved to: S288C_chrIV_200000-220000_genomic_nt_imblance_plot.png

Remember, you may need to run that cell above twice for the result to show up.

You'll note subtle differences from the earlier plots indicating the different settings were used. One particular place to look is the very first section of the plot on the left side.

Automating argument formation but still ultimately using the command line-style call to the script

Because we are in a Jupyter environment, we could have used code to extract the details used in the last two required arguments of the command call, specifically chrIV and 200000, from the file name. I'll illustrate that for this example because it could be handy if processing multiple regions.

In [11]:
#Use fnmatch to make a list of files ending in `.fsa` and the work through them splicing out useful bits to make plot command
import os
import sys
import fnmatch
name_part_to_match = ".fsa"
associated_fns= []
for file in os.listdir('.'):
    if fnmatch.fnmatch(file, '*'+name_part_to_match):
        associated_fns.append(file)
sys.stderr.write("Files ending in `{}` identified.\nReading and plotting each...".format(name_part_to_match))
# Interate over list to process name and fire off commands
for fn in associated_fns:
    chr_name = fn.split("_")[1]
    start_num = fn.split("_")[2].rsplit("-")[0]
    !python nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016.py {fn} {chr_name} {start_num}
    sys.stderr.write(f"Processed assembled command:\n 'python nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016.py {fn} {chr_name} {start_num}' ")
sys.stderr.write("\nEnd of list of files ending in `{}`reached.\nDone.".format(name_part_to_match))
Files ending in `.fsa` identified.
Reading and plotting each...
Plot image saved to: S288C_chrIV_200000-220000_genomic_nt_imblance_plot.png
Processed assembled command:
 'python nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016.py S288C_chrIV_200000-220000_genomic.fsa chrIV 200000' 
End of list of files ending in `.fsa`reached.
Done.

Although you may wish to further simplify things by importing the main function of this script and provide the details as Python objects that way. This is illustrated at the end of the next section about importing the main function.

The only option other not already illustrated is adding --save_vg (or the abbreviated form -svg) to the commands to produce vector graphics that are very useful for further modification by Inkscape or Adobe Illustrator. The bitmapped 'PNG'-style graphics or more universally viewable and so are used here for convenience; however, they don't scale as well as vector graphics.

Use script in a Jupyter notebook

This will demonstrate importing the main function into a notebook.

In [12]:
# Get the script if the above section wasn't run. Does nothing if it was.
import os
file_needed = "nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016.py"
if not os.path.isfile(file_needed):
    !curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/plot_nt_imbalance/nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016.py

Let's make it so the equivalent to the example command worked out above and shown again on the line below runs and the plot is returned directly into the notebook environment as a figure object to display in this notebook.

python nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016.py S288C_chrIV_200000-220000_genomic.fsa chrIV 200000

At the end of this section we'll show how the returned notebook can be further modified via the notebook interface when doing this.

First we need to import the main function from the script file.

In [13]:
from nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016 import nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016

Now to try using that to do what was initially done with the sequence data above. The main difference besides the arrangement of the items is that we should specify return_fig=True.

In [14]:
%matplotlib inline
out = nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016("S288C_chrIV_200000-220000_genomic.fsa", "chrIV", 200000, return_fig=True);
Plot figure object returned.

Note that if we simply had left off return_fig=True, it would have been a perfectly valid statement. It would have been interpreted as set to False and an image file would be produced just like when calling from the command line.

However, returning the figure as a Python object affords us the opportunity to further modify it within the Jupyter environment, even after the function generated it via the code in the python nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016.py script. This will be demonstrate below at the end of this section; however before getting to that novel use of the main function and the Jupyter environment, let's revisit the example above where we scripted capaturing of the elements needed in the call to the script from the input file name. If we stay within all Python it becomes slightly easier to do because we don't need to use the brackets to feed python variables into the shell commands.

Automating argument formation entirely via Python

The next cell recapitulates the approach illustrate at the end of the basic usage on the command line section for how'd you do that in the notebook after importing the main function.

In [15]:
#Use fnmatch to make a list of files ending in `.fsa` and the work through them splicing out useful bits to make plot command
import os
import sys
import fnmatch
name_part_to_match = ".fsa"
associated_fns= []
for file in os.listdir('.'):
    if fnmatch.fnmatch(file, '*'+name_part_to_match):
        associated_fns.append(file)
sys.stderr.write("Files ending in `{}` identified.\nReading and plotting each...".format(name_part_to_match))
# Interate over list to process name and fire off commands
for fn in associated_fns:
    chr_name = fn.split("_")[1]
    start_num = int(fn.split("_")[2].rsplit("-")[0])
    nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016(fn, chr_name, start_num, return_fig=True)
sys.stderr.write("\nEnd of list of files ending in `{}`reached.\nDone.".format(name_part_to_match))
Files ending in `.fsa` identified.
Reading and plotting each...Plot figure object returned.
End of list of files ending in `.fsa`reached.
Done.

The only difference besides the line calling the function was that we needed to cast start_num to integer.

The size of the window of basepairs analyzed and the degree of overlap of those windows can be changed when the function is used via import as well, you just need to specify the chunk_size and overlap_specified when calling the function. For example:

In [16]:
%matplotlib inline
nucleotide_difference_imbalance_plot_stylized_like_Figure_8_of_Morrill_et_al_2016("S288C_chrIV_200000-220000_genomic.fsa", "CHANGED SETTINGS EXAMPLE", 200000, chunk_size = 80, overlap_specified=30, return_fig=True);
Plot figure object returned.