There are several options to run roughly_score_relationships_to_subject_seq_pairwise_premsa.py
or utilize the core function.
This notebook will demonstrate a few ways of using this script:
Notably, it won't cover running the script after pasting or loading it into a cell. An approach to that is illustrated here and here for different scripts, but those should serve as a good guides combined with what is shown here for using the main function in a notebook with import. There are a copuple of ways to get the script into the cell, namely pasting it in or loading it from github, that are covered there.
(If you are having any problems at all doing any of this because of Python or needed dependencies, such as Bioython, this notebook was developed in the enviromenment launchable by pressing Launch binder
badge here. You could always launch that environment and upload this notebook there and things should work.)
Similar to how one would run a script from the command line.
Upload the script to the directory where you want to run it. Or upload it to a running Jupyter environment.
(For the sake of this demonstration, I am going to use curl
to get the file from github and upload it to the 'local' environment. You of course can use whatever download and upload steps you'd like, such as using a browser and your system's graphical user interface, to place the script in the directory. 'local' is in parentheses because if running this in a Jupyter interface via the Binder system, 'local' would be inside the running enviroment.)
!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/alignment-utilities/roughly_score_relationships_to_subject_seq_pairwise_premsa.py
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 20938 100 20938 0 0 68649 0 --:--:-- --:--:-- --:--:-- 68875
That command would work on the command line without the exclamation point. The use of the exclamation point signals here to not treat it as Python code and instead target the command to the available command line shell.
THEN AFTER UPLOADED...
If running on the command line then you would enter:
python roughly_score_relationships_to_subject_seq_pairwise_premsa.py my_sequences.fa
Or something similar to that depending on your Python environment and source of data.
Similarly you can do that in the Jupyter environment using either either !python
before the script name or using the %run
magic command.
The %run
magic command is demonstrated in the next cell. If you are in an active Jupyter environment, to run it click on the next cell and type shift-enter
or click run on the toolbar above the notebook.
%run roughly_score_relationships_to_subject_seq_pairwise_premsa.py --help
usage: roughly_score_relationships_to_subject_seq_pairwise_premsa.py [-h] [-bl BLOCK_LEN] [-dfo DF_OUTPUT] SEQS_FILE roughly_score_relationships_to_subject_seq_pairwise_premsa.py Takes a file of multiple sequences in FASTA format and aligns each of them in turn to the first sequence. However, if the sequence happens to be moderate- or large- sized (> 5 kb), by default it only samples part of the sequence due to memory limitations. It scores the alignments and produces a dataframe ranking the sequences from most similar to most different relative the first one in the supplied file. The dataframe is saved as a tabular text file when used on the command line. Optionally, it can also return that dataframe for use inside a Jupyter notebook. **** Script by Wayne Decatur (fomightez @ github) *** positional arguments: SEQS_FILE Name of file of sequences (all FASTA-formatted) to compare to the first in that file . optional arguments: -h, --help show this help message and exit -bl BLOCK_LEN, --block_len BLOCK_LEN **FOR ADVANCED USE.*** Allows for setting the sequence blocks compared for long sequences. The default of 9141 was worked out for uisng in a session launched form MyBinder.org 2018. You can free to make larger if you have more computational resources. For example, `-bl 15000`; however, the failing condition is just a silent(hanging) state. -dfo DF_OUTPUT, --df_output DF_OUTPUT OPTIONAL: Set file name for saving tabular text (tab- delimited) derived from the produced dataframe. If none provided, 'ranked_seqs_df.tsv' will be used. To force no table to be saved, enter `-dfo no_table` without quotes or ticks as output file (ATYPICAL).
The USAGE
shown as the output from in the above cell due to running with the script with the --help/-h
flag.
For the rest of this section we are going to obtain real data and use that to provide actual arguments to call the script as the USAGE
outlines.
First, the next call will get some sequence data and concatenate it into one file. The first file in the resulting file is the one the others will be compared against.
#make a list of the strain designations
yue_et_al_strains = ["S288C","DBVPG6044","DBVPG6765","SK1","Y12",
"YPS128","UWOPS034614","CBS432","N44","YPS138",
"UFRJ50816","UWOPS919171"]
# Get set of sequences and edit description line to be cleaner
for s in yue_et_al_strains:
!curl -LO http://yjx1217.github.io/Yeast_PacBio_2016/data/Mitochondrial_Genome/{s}.mt.genome.fa.gz
!gunzip -f {s}.mt.genome.fa.gz
!sed -i "1s/.*/>{s}/" {s}.mt.genome.fa
# concatenate the FASTA-formatted sequences into one FASTA file
seq_files = [s+".mt.genome.fa" for s in yue_et_al_strains]
!cat {" ".join(seq_files)} > seq_files.fa
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 178 100 178 0 0 1745 0 --:--:-- --:--:-- --:--:-- 1745 100 22109 100 22109 0 0 70862 0 --:--:-- --:--:-- --:--:-- 70862 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 178 100 178 0 0 1745 0 --:--:-- --:--:-- --:--:-- 1745 100 20363 100 20363 0 0 83114 0 --:--:-- --:--:-- --:--:-- 83114 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 178 100 178 0 0 2342 0 --:--:-- --:--:-- --:--:-- 2342 100 20829 100 20829 0 0 89012 0 --:--:-- --:--:-- --:--:-- 89012 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 178 100 178 0 0 2342 0 --:--:-- --:--:-- --:--:-- 2342 100 21373 100 21373 0 0 98493 0 --:--:-- --:--:-- --:--:-- 98493 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 178 100 178 0 0 2342 0 --:--:-- --:--:-- --:--:-- 2342 100 21190 100 21190 0 0 96757 0 --:--:-- --:--:-- --:--:-- 96757 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 178 100 178 0 0 1390 0 --:--:-- --:--:-- --:--:-- 1390 100 19885 100 19885 0 0 73376 0 --:--:-- --:--:-- --:--:-- 404k % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 178 100 178 0 0 1745 0 --:--:-- --:--:-- --:--:-- 1745 100 18809 100 18809 0 0 72621 0 --:--:-- --:--:-- --:--:-- 72621 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 178 100 178 0 0 2342 0 --:--:-- --:--:-- --:--:-- 2342 100 18692 100 18692 0 0 62306 0 --:--:-- --:--:-- --:--:-- 62306 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 178 100 178 0 0 1328 0 --:--:-- --:--:-- --:--:-- 1328 100 18399 100 18399 0 0 63226 0 --:--:-- --:--:-- --:--:-- 63226 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 178 100 178 0 0 3016 0 --:--:-- --:--:-- --:--:-- 3016 100 18509 100 18509 0 0 69322 0 --:--:-- --:--:-- --:--:-- 69322 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 178 100 178 0 0 1745 0 --:--:-- --:--:-- --:--:-- 1745 100 18886 100 18886 0 0 77720 0 --:--:-- --:--:-- --:--:-- 77720 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 178 100 178 0 0 2342 0 --:--:-- --:--:-- --:--:-- 2342 100 17877 100 17877 0 0 62947 0 --:--:-- --:--:-- --:--:-- 62947
With some data retrieved, you are ready to run the script. Running the next cell will compare each sequence after the first one in the file in turn to the first sequence and score the similarity.
%run roughly_score_relationships_to_subject_seq_pairwise_premsa.py seq_files.fa
Sequences read in... Longest sequence in input detected as 85793. ...calculating scores of pairwise alignments... Anticipated memory issues with long sequence and so only block of 9141 bps from the start compared. ... Super long sequence detected: Comparing a 2nd block back from the sequence 'end' as well and combining scores. ...summarizing scores... Results converted to a dataframe... A table of the data has been saved as a text file (tab-delimited). DATA is stored as ==> 'ranked_seqs_df.tsv'
When run from the command line, which is essentially what the %run
command does here, the output is saved as a file with the results as tabular text automatically.
Open ranked_seqs_df.tsv
or what you opted to name it and review the results. Those with the highest score, being most similar, will be towards the top.
(Note that DBVPG676
comes out as more different from S228C
than expected here. This probably is just due to a sampling issue as the 'start' in these sequences happens to be the most different part as shown in the Dot Matrix View plot generated from BLAST comparision of these two sequences. When I have been careful to adjust the sequence so the 'start' sites are what the SGD reference uses, it comes out as the most similar to S288C
. This actually highlights the fact that not having the memory to handle sampling all the sequence in large sequences can end up being misleading and to only consider this as a quick-an-dirty approach to get information about which sequences are similar and which ones are more different. Follow this up by an multiple sequence alignment ASAP.)
This is similar to 'Running core function of the script after loading into a cell', but here we take advantage of Python's import statement to do what was previously handled by pasting or loading code into a cell and running it.
First insure the script is available where you are running. Running the next command will do that here.
!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/alignment-utilities/roughly_score_relationships_to_subject_seq_pairwise_premsa.py
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 20938 100 20938 0 0 61946 0 --:--:-- --:--:-- --:--:-- 61764
Then import the main function of the script to the notebook's active computational environment via an import statement.
from roughly_score_relationships_to_subject_seq_pairwise_premsa import roughly_score_relationships_to_subject_seq_pairwise_premsa
(As written above the command to do that looks a bit redundant; however, the first from
part of the command below actually is referencing the find_sequence_element_occurrences_in_sequence
script, but it doesn't need the .py
extension because the import
only deals with such files.)
With the main function imported, it is now available to be run. We'll use the data retrieved above. (Run the third cell in the botebook if you haven't already before running the next cell.)
This time, by default, though the results will also be returned as a dataframe that can be viewed directly in the notebook and subsequently utilized.
df = roughly_score_relationships_to_subject_seq_pairwise_premsa("seq_files.fa")
df
Sequences read in... Longest sequence in input detected as 85793. ...calculating scores of pairwise alignments... Anticipated memory issues with long sequence and so only block of 9141 bps from the start compared. ... Super long sequence detected: Comparing a 2nd block back from the sequence 'end' as well and combining scores. ...summarizing scores... Results converted to a dataframe... A table of the data has been saved as a text file (tab-delimited). DATA is stored as ==> 'ranked_seqs_df.tsv' Returning a dataframe with the information as well.
id | score_vs_S288C | |
---|---|---|
3 | Y12 | 14129.5 |
4 | YPS128 | 12898.5 |
2 | SK1 | 12345.0 |
5 | UWOPS034614 | 12197.5 |
6 | CBS432 | 10283.0 |
0 | DBVPG6044 | 9973.0 |
10 | UWOPS919171 | 9835.0 |
9 | UFRJ50816 | 9771.0 |
8 | YPS138 | 8994.5 |
7 | N44 | 8834.5 |
1 | DBVPG6765 | 8527.0 |
df.describe()
score_vs_S288C | |
---|---|
count | 11.000000 |
mean | 10708.045455 |
std | 1869.949872 |
min | 8527.000000 |
25% | 9382.750000 |
50% | 9973.000000 |
75% | 12271.250000 |
max | 14129.500000 |
I have found on limited computational resources that this script will just stop working silently and hang. I have made an effort to set this to work on Jupyter sessions launched from MyBinder.org. For example, I suggest going here, pressing launch binder
and uploading this notebook there to run it actively. It will most likely not hang unless they have lowered the limits or things are overly busy.
If you find it isn't working, I suggest you insure it is the memory issue by lowering the block_len
(a.k.a length of the sequence block to examine
) option to something tiny. The data it produces won't be useful but it should tell you if that is why your script is working or not.
How to do that depends if you are using it in an Jupyter notebook cell or at what is equivalent to the command line. This section will demonstrate the two approaches.
First on the command line, you can set the block_len
like so:
%run roughly_score_relationships_to_subject_seq_pairwise_premsa.py seq_files.fa --block_len 200
Sequences read in... Longest sequence in input detected as 85793. ...calculating scores of pairwise alignments... Anticipated memory issues with long sequence and so only block of 200 bps from the start compared. ... Super long sequence detected: Comparing a 2nd block back from the sequence 'end' as well and combining scores. ...summarizing scores... Results converted to a dataframe... A table of the data has been saved as a text file (tab-delimited). DATA is stored as ==> 'ranked_seqs_df.tsv'
That should run instantly and generate a file of results.
Here is how to set the block_len
when working inside the notebook.
from roughly_score_relationships_to_subject_seq_pairwise_premsa import roughly_score_relationships_to_subject_seq_pairwise_premsa
df = roughly_score_relationships_to_subject_seq_pairwise_premsa("seq_files.fa", block_len=200)
df
Sequences read in... Longest sequence in input detected as 85793. ...calculating scores of pairwise alignments... Anticipated memory issues with long sequence and so only block of 200 bps from the start compared. ... Super long sequence detected: Comparing a 2nd block back from the sequence 'end' as well and combining scores. ...summarizing scores... Results converted to a dataframe... A table of the data has been saved as a text file (tab-delimited). DATA is stored as ==> 'ranked_seqs_df.tsv' Returning a dataframe with the information as well.
id | score_vs_S288C | |
---|---|---|
4 | YPS128 | 360.5 |
5 | UWOPS034614 | 356.0 |
3 | Y12 | 351.5 |
0 | DBVPG6044 | 334.5 |
2 | SK1 | 334.5 |
8 | YPS138 | 324.0 |
9 | UFRJ50816 | 324.0 |
10 | UWOPS919171 | 316.0 |
1 | DBVPG6765 | 315.0 |
6 | CBS432 | 313.0 |
7 | N44 | 310.0 |
Remember, the data this produces won't be informative. It is just to make sure something more fundamental isn't causing you a problem. If you get it working with a tiny value then you can attempt to use more reasonable sizes.