Notebook

Examples of running the script¶

There are several options to run roughly_score_relationships_to_subject_seq_pairwise_premsa.py or utilize the core function.

This notebook will demonstrate a few ways of using this script:

Running as script file
Running core function of the script after import

Notably, it won't cover running the script after pasting or loading it into a cell. An approach to that is illustrated here and here for different scripts, but those should serve as a good guides combined with what is shown here for using the main function in a notebook with import. There are a copuple of ways to get the script into the cell, namely pasting it in or loading it from github, that are covered there.

(If you are having any problems at all doing any of this because of Python or needed dependencies, such as Bioython, this notebook was developed in the enviromenment launchable by pressing Launch binder badge here. You could always launch that environment and upload this notebook there and things should work.)

Running as script file¶

Similar to how one would run a script from the command line.

Upload the script to the directory where you want to run it. Or upload it to a running Jupyter environment.

(For the sake of this demonstration, I am going to use curl to get the file from github and upload it to the 'local' environment. You of course can use whatever download and upload steps you'd like, such as using a browser and your system's graphical user interface, to place the script in the directory. 'local' is in parentheses because if running this in a Jupyter interface via the Binder system, 'local' would be inside the running enviroment.)

In [1]:

!curl -O  https://raw.githubusercontent.com/fomightez/sequencework/master/alignment-utilities/roughly_score_relationships_to_subject_seq_pairwise_premsa.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 20938  100 20938    0     0  68649      0 --:--:-- --:--:-- --:--:-- 68875

That command would work on the command line without the exclamation point. The use of the exclamation point signals here to not treat it as Python code and instead target the command to the available command line shell.

THEN AFTER UPLOADED...
If running on the command line then you would enter:

python roughly_score_relationships_to_subject_seq_pairwise_premsa.py my_sequences.fa

Or something similar to that depending on your Python environment and source of data.

Similarly you can do that in the Jupyter environment using either either !python before the script name or using the %run magic command.
The %run magic command is demonstrated in the next cell. If you are in an active Jupyter environment, to run it click on the next cell and type shift-enter or click run on the toolbar above the notebook.

In [2]:

%run roughly_score_relationships_to_subject_seq_pairwise_premsa.py --help

usage: roughly_score_relationships_to_subject_seq_pairwise_premsa.py
       [-h] [-bl BLOCK_LEN] [-dfo DF_OUTPUT] SEQS_FILE

roughly_score_relationships_to_subject_seq_pairwise_premsa.py Takes a file of
multiple sequences in FASTA format and aligns each of them in turn to the
first sequence. However, if the sequence happens to be moderate- or large-
sized (> 5 kb), by default it only samples part of the sequence due to memory
limitations. It scores the alignments and produces a dataframe ranking the
sequences from most similar to most different relative the first one in the
supplied file. The dataframe is saved as a tabular text file when used on the
command line. Optionally, it can also return that dataframe for use inside a
Jupyter notebook. **** Script by Wayne Decatur (fomightez @ github) ***

positional arguments:
  SEQS_FILE             Name of file of sequences (all FASTA-formatted) to
                        compare to the first in that file .

optional arguments:
  -h, --help            show this help message and exit
  -bl BLOCK_LEN, --block_len BLOCK_LEN
                        **FOR ADVANCED USE.*** Allows for setting the sequence
                        blocks compared for long sequences. The default of
                        9141 was worked out for uisng in a session launched
                        form MyBinder.org 2018. You can free to make larger if
                        you have more computational resources. For example,
                        `-bl 15000`; however, the failing condition is just a
                        silent(hanging) state.
  -dfo DF_OUTPUT, --df_output DF_OUTPUT
                        OPTIONAL: Set file name for saving tabular text (tab-
                        delimited) derived from the produced dataframe. If
                        none provided, 'ranked_seqs_df.tsv' will be used. To
                        force no table to be saved, enter `-dfo no_table`
                        without quotes or ticks as output file (ATYPICAL).

The USAGE shown as the output from in the above cell due to running with the script with the --help/-h flag.

For the rest of this section we are going to obtain real data and use that to provide actual arguments to call the script as the USAGE outlines.

First, the next call will get some sequence data and concatenate it into one file. The first file in the resulting file is the one the others will be compared against.

In [3]:

#make a list of the strain designations
yue_et_al_strains = ["S288C","DBVPG6044","DBVPG6765","SK1","Y12",
                     "YPS128","UWOPS034614","CBS432","N44","YPS138",
                     "UFRJ50816","UWOPS919171"]
# Get set of sequences and edit description line to be cleaner
for s in yue_et_al_strains:
    !curl -LO http://yjx1217.github.io/Yeast_PacBio_2016/data/Mitochondrial_Genome/{s}.mt.genome.fa.gz
    !gunzip -f {s}.mt.genome.fa.gz
    !sed -i "1s/.*/>{s}/" {s}.mt.genome.fa
# concatenate the FASTA-formatted sequences into one FASTA file
seq_files = [s+".mt.genome.fa" for s in yue_et_al_strains]
!cat {" ".join(seq_files)} > seq_files.fa

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   178  100   178    0     0   1745      0 --:--:-- --:--:-- --:--:--  1745
100 22109  100 22109    0     0  70862      0 --:--:-- --:--:-- --:--:-- 70862
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   178  100   178    0     0   1745      0 --:--:-- --:--:-- --:--:--  1745
100 20363  100 20363    0     0  83114      0 --:--:-- --:--:-- --:--:-- 83114
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   178  100   178    0     0   2342      0 --:--:-- --:--:-- --:--:--  2342
100 20829  100 20829    0     0  89012      0 --:--:-- --:--:-- --:--:-- 89012
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   178  100   178    0     0   2342      0 --:--:-- --:--:-- --:--:--  2342
100 21373  100 21373    0     0  98493      0 --:--:-- --:--:-- --:--:-- 98493
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   178  100   178    0     0   2342      0 --:--:-- --:--:-- --:--:--  2342
100 21190  100 21190    0     0  96757      0 --:--:-- --:--:-- --:--:-- 96757
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   178  100   178    0     0   1390      0 --:--:-- --:--:-- --:--:--  1390
100 19885  100 19885    0     0  73376      0 --:--:-- --:--:-- --:--:--  404k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   178  100   178    0     0   1745      0 --:--:-- --:--:-- --:--:--  1745
100 18809  100 18809    0     0  72621      0 --:--:-- --:--:-- --:--:-- 72621
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   178  100   178    0     0   2342      0 --:--:-- --:--:-- --:--:--  2342
100 18692  100 18692    0     0  62306      0 --:--:-- --:--:-- --:--:-- 62306
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   178  100   178    0     0   1328      0 --:--:-- --:--:-- --:--:--  1328
100 18399  100 18399    0     0  63226      0 --:--:-- --:--:-- --:--:-- 63226
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   178  100   178    0     0   3016      0 --:--:-- --:--:-- --:--:--  3016
100 18509  100 18509    0     0  69322      0 --:--:-- --:--:-- --:--:-- 69322
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   178  100   178    0     0   1745      0 --:--:-- --:--:-- --:--:--  1745
100 18886  100 18886    0     0  77720      0 --:--:-- --:--:-- --:--:-- 77720
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   178  100   178    0     0   2342      0 --:--:-- --:--:-- --:--:--  2342
100 17877  100 17877    0     0  62947      0 --:--:-- --:--:-- --:--:-- 62947

With some data retrieved, you are ready to run the script. Running the next cell will compare each sequence after the first one in the file in turn to the first sequence and score the similarity.

In [4]:

%run roughly_score_relationships_to_subject_seq_pairwise_premsa.py seq_files.fa

Sequences read in...
Longest sequence in input detected as 85793.
...calculating scores of pairwise alignments...
Anticipated memory issues with long sequence and
so only block of 9141 bps from the start compared.
...
Super long sequence detected: Comparing a 2nd block back from the
sequence 'end' as well and combining scores.
...summarizing scores...
Results converted to a dataframe...

A table of the data has been saved as a text file (tab-delimited).
DATA is stored as ==> 'ranked_seqs_df.tsv'

When run from the command line, which is essentially what the %run command does here, the output is saved as a file with the results as tabular text automatically. Open ranked_seqs_df.tsv or what you opted to name it and review the results. Those with the highest score, being most similar, will be towards the top.

(Note that DBVPG676 comes out as more different from S228C than expected here. This probably is just due to a sampling issue as the 'start' in these sequences happens to be the most different part as shown in the Dot Matrix View plot generated from BLAST comparision of these two sequences. When I have been careful to adjust the sequence so the 'start' sites are what the SGD reference uses, it comes out as the most similar to S288C. This actually highlights the fact that not having the memory to handle sampling all the sequence in large sequences can end up being misleading and to only consider this as a quick-an-dirty approach to get information about which sequences are similar and which ones are more different. Follow this up by an multiple sequence alignment ASAP.)

Running the script with import of the main function¶

This is similar to 'Running core function of the script after loading into a cell', but here we take advantage of Python's import statement to do what was previously handled by pasting or loading code into a cell and running it.

First insure the script is available where you are running. Running the next command will do that here.

In [5]:

!curl -O  https://raw.githubusercontent.com/fomightez/sequencework/master/alignment-utilities/roughly_score_relationships_to_subject_seq_pairwise_premsa.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 20938  100 20938    0     0  61946      0 --:--:-- --:--:-- --:--:-- 61764

Then import the main function of the script to the notebook's active computational environment via an import statement.

In [1]:

from roughly_score_relationships_to_subject_seq_pairwise_premsa import roughly_score_relationships_to_subject_seq_pairwise_premsa

(As written above the command to do that looks a bit redundant; however, the first from part of the command below actually is referencing the find_sequence_element_occurrences_in_sequence script, but it doesn't need the .py extension because the import only deals with such files.)

With the main function imported, it is now available to be run. We'll use the data retrieved above. (Run the third cell in the botebook if you haven't already before running the next cell.)

This time, by default, though the results will also be returned as a dataframe that can be viewed directly in the notebook and subsequently utilized.

In [2]:

df = roughly_score_relationships_to_subject_seq_pairwise_premsa("seq_files.fa")
df

Sequences read in...
Longest sequence in input detected as 85793.
...calculating scores of pairwise alignments...
Anticipated memory issues with long sequence and
so only block of 9141 bps from the start compared.
...
Super long sequence detected: Comparing a 2nd block back from the
sequence 'end' as well and combining scores.
...summarizing scores...
Results converted to a dataframe...

A table of the data has been saved as a text file (tab-delimited).
DATA is stored as ==> 'ranked_seqs_df.tsv'

Returning a dataframe with the information as well.

Out[2]:

	id	score_vs_S288C
3	Y12	14129.5
4	YPS128	12898.5
2	SK1	12345.0
5	UWOPS034614	12197.5
6	CBS432	10283.0
0	DBVPG6044	9973.0
10	UWOPS919171	9835.0
9	UFRJ50816	9771.0
8	YPS138	8994.5
7	N44	8834.5
1	DBVPG6765	8527.0

In [3]:

df.describe()

Out[3]:

	score_vs_S288C
count	11.000000
mean	10708.045455
std	1869.949872
min	8527.000000
25%	9382.750000
50%	9973.000000
75%	12271.250000
max	14129.500000

Troubleshooting¶

I have found on limited computational resources that this script will just stop working silently and hang. I have made an effort to set this to work on Jupyter sessions launched from MyBinder.org. For example, I suggest going here, pressing launch binder and uploading this notebook there to run it actively. It will most likely not hang unless they have lowered the limits or things are overly busy.

If you find it isn't working, I suggest you insure it is the memory issue by lowering the block_len (a.k.a length of the sequence block to examine) option to something tiny. The data it produces won't be useful but it should tell you if that is why your script is working or not.

How to do that depends if you are using it in an Jupyter notebook cell or at what is equivalent to the command line. This section will demonstrate the two approaches.

First on the command line, you can set the block_len like so:

In [4]:

%run roughly_score_relationships_to_subject_seq_pairwise_premsa.py seq_files.fa --block_len 200

Sequences read in...
Longest sequence in input detected as 85793.
...calculating scores of pairwise alignments...
Anticipated memory issues with long sequence and
so only block of 200 bps from the start compared.
...
Super long sequence detected: Comparing a 2nd block back from the
sequence 'end' as well and combining scores.
...summarizing scores...
Results converted to a dataframe...

A table of the data has been saved as a text file (tab-delimited).
DATA is stored as ==> 'ranked_seqs_df.tsv'

That should run instantly and generate a file of results.

Here is how to set the block_len when working inside the notebook.

In [5]:

from roughly_score_relationships_to_subject_seq_pairwise_premsa import roughly_score_relationships_to_subject_seq_pairwise_premsa
df = roughly_score_relationships_to_subject_seq_pairwise_premsa("seq_files.fa", block_len=200)
df

Sequences read in...
Longest sequence in input detected as 85793.
...calculating scores of pairwise alignments...
Anticipated memory issues with long sequence and
so only block of 200 bps from the start compared.
...
Super long sequence detected: Comparing a 2nd block back from the
sequence 'end' as well and combining scores.
...summarizing scores...
Results converted to a dataframe...

A table of the data has been saved as a text file (tab-delimited).
DATA is stored as ==> 'ranked_seqs_df.tsv'

Returning a dataframe with the information as well.

Out[5]:

	id	score_vs_S288C
4	YPS128	360.5
5	UWOPS034614	356.0
3	Y12	351.5
0	DBVPG6044	334.5
2	SK1	334.5
8	YPS138	324.0
9	UFRJ50816	324.0
10	UWOPS919171	316.0
1	DBVPG6765	315.0
6	CBS432	313.0
7	N44	310.0

Remember, the data this produces won't be informative. It is just to make sure something more fundamental isn't causing you a problem. If you get it working with a tiny value then you can attempt to use more reasonable sizes.