Examples of running the script

There are several options to run find_sequence_element_occurrences_in_sequence or utilize the core function.

This notebook will demonstrate a few ways of using this script:

Notably, it won't cover running the script after upload. An approach to that is illustrated here for a different script, but that should serve as a good guide combined with getting the USAGE info by running it with the --help/-h flag.

(If you are having any problems at all doing any of this because of Python or needed dependencies, such as Bioython, this notebook was developed in the enviromenment launchable by pressing Launch binder badge here. You could always launch that environment and upload this notebook there and things should work.)

It will also demonstrate a more complex use of the core function to process multiple searches for sequence elements in a manner reminiscent of a mini-pipeline.

Running as script file

Similar to how one would run a script from the command line.

Upload the script to the directory where you want to run it. Or upload it to a running Jupyter environment.

A similar approach to the latter is illustrated here for a different script but that should serve as a good guide combined with getting the USAGE info by running it with the --help/-h flag.


Running core function of the script after loading into a cell

It can be pasted into a cell or loaded from github. Those will be demonstrated in this section of the notebook.

First part of this section will cover pasting script into a cell.
In the next cell is the script (although it might not be the most up-to-date version, and so it would be best to get and paste the most-up-to-date version from Github or use the %loadapproach to fetch it from Github directly into a cell, as illustrated here)

In [1]:
%pip install regex
%pip install openpyxl
Requirement already satisfied: regex in /srv/conda/envs/notebook/lib/python3.7/site-packages (2021.11.10)
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: openpyxl in /srv/conda/envs/notebook/lib/python3.7/site-packages (3.0.9)
Requirement already satisfied: et-xmlfile in /srv/conda/envs/notebook/lib/python3.7/site-packages (from openpyxl) (1.1.0)
Note: you may need to restart the kernel to use updated packages.
In [2]:
#!/usr/bin/env python
# find_sequence_element_occurrences_in_sequence.py
__author__ = "Wayne Decatur" #fomightez on GitHub
__license__ = "MIT"
__version__ = "0.1.0"


# find_sequence_element_occurrences_in_sequence.py by Wayne Decatur
# ver 0.1
#
#*******************************************************************************
# Verified compatible with both Python 2.7 and Python 3.6; written initially in 
# Python 3.
#
# PURPOSE: Takes a short sequence represented as a string and a FASTA-formatted
# sequence (either a file or URL to a file) and makes an accounting of the 
# occurrences of that short sequence element on both strands of the main 
# sequence. Exact matches only are counted as occurrences.
# The script is case-insensitive, meaning the case of either the element or 
# FASTA sequence provided do not matter and matches will be noted anyway.
#
# As written, the script expects and only processes one FASTA-formatted 
# sequence. If your, FASTA file has more than one sequence entry within it and 
# the one you want to have scanned is not the first, copy and paste it to a new 
# file and use that as the sequence to scan.
#
# Written to run from command line or pasted/loaded inside a Jupyter notebook 
# cell or imported. 
#
#
#
# This script based on work and musings developed in 
# `Resources for writing code to do GC cluster accounting.md`
#
# The script at https://www.biostars.org/p/209383/ (steve's answer) served as 
# the backbone for adding convenience and user-friendly features.
#
#
#
# Dependencies beyond the mostly standard libraries/modules:
# - Biopython
#
#
# VERSION HISTORY:
# v.0.1. basic working version
#
# To do:
# - 
#
#
#
# TO RUN:
# Examples,
# Enter on the command line of your terminal, the line
#-----------------------------------
# python find_sequence_element_occurrences_in_sequence.py GAATTC sequences.fasta

#-OR-
# python find_sequence_element_occurrences_in_sequence.py GAATTC sequences.fasta URL
#-----------------------------------
# Issue `python find_sequence_element_occurrences_in_sequence.py -h` for details.
# 
#
# To use this after pasting or loading into a cell in a Jupyter notebook, in
# the next cell define critical variables and then call the main function 
# similar to below:
# source = "https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa"
# element = "GAATTC"
# id_of_seq_element = "ele1" #Set to `None` without quotes or backticks to have defined automatically
# find_sequence_element_occurrences_in_sequence(source, element, id_of_seq_element, return_dataframe = True)
#
# Something similar would need to be done if importing the script into another.
# (`id_of_seq_scanned_hardcoded` can be assigned in a cell before calling the 
# function as well.)
#
# 
#
'''
CURRENT ACTUAL CODE FOR RUNNING/TESTING IN A NOTEBOOK WHEN LOADED OR PASTED IN 
ANOTHER CELL:
source = "https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa"
element = "GAATTC"
id_of_seq_element = "EcoRI" #Set to `None` without quotes or backticks to have defined automatically
df = find_sequence_element_occurrences_in_sequence(source, element, id_of_seq_element, return_dataframe = True)


'''
#
#
#*******************************************************************************
#







#*******************************************************************************
##################################
#  USER ADJUSTABLE VALUES        #

##################################
#
output_file_name_prefix = "seq_account"
id_of_seq_scanned_hardcoded = None # replace `None` with what you want to use,
# with flanking quotes if something appropriate is not being extracted from the
# provided filepath/filename or URL to save as an indicator of the file scanned
# in the output file name. 



limit_of_name = 17 # number of bases of the sequence element to limit to using 
# if the sequence element sequence is used to make the name for the output file





#
#*******************************************************************************
#**********************END USER ADJUSTABLE VARIABLES****************************


















#*******************************************************************************
#*******************************************************************************
###DO NOT EDIT BELOW HERE - ENTER VALUES ABOVE###

import sys
import os
import regex
import pandas as pd
from Bio import SeqIO
from Bio.Seq import Seq # for reverse complement




###---------------------------HELPER FUNCTIONS---------------------------------###


def get_seq_element_representation(id_of_seq_element):
    '''
    Takes `id_of_seq_element` and returns a string represenation for using
    in the output file name and the dataframe.
    '''
    if id_of_seq_element:
        elem_id = id_of_seq_element
    elif len(element) > limit_of_name:
        elem_id = element[:limit_of_name]+"..."
    else:
        elem_id = element
    return elem_id



def generate_output_file_name(id_of_seq_element,id_of_seq_scanned):
    '''
    Takes a file name as an argument and returns string for the name of the
    output file. The generated name is based on a prefix that can be adjusted
    under the section ' USER ADJUSTABLE VALUES ', plus the provided text in 
    the function call.
    If there is no `id_of_seq_element` specified, the sequence element will be
    used, limited to the first number of bases specified in `limit_of_name`.

    Specific examples
    =================
    Calling function with
        ("elem1","chrmt")
    returns
        "seq_account_elem1_chrmt.tsv"

    Calling function with
        (None,"chrmt")
    returns
        "seq_account_GAATTC_chrmt.tsv"
    if `GAATTC` happened to be the provided sequence element.
    '''
    elem_id = get_seq_element_representation(id_of_seq_element)
    if elem_id.endswith("...") and (id_of_seq_element == None):
        # Because in order to use `get_seq_element_representation()` in all the 
        # places something similar is needed, I have it adding `...` if the
        # sequence of the element exceeds the limit size. Since I don't want
        # that in the file name I am going to remove if it seems I had added it,
        # and I will only have possibly added it if `id_of_seq_element` is None.
        # This extra check is mitigate chances I'll remove `...` in unlikely
        # possibility user wanted to include it in id for sequence element. 
        elem_id = elem_id[:-3] + "-" #add hyphen to indicate there is more 
    return "{prefix}_{elem_id}_{seqid}.tsv".format(
        prefix=output_file_name_prefix,elem_id=elem_id,
        seqid=id_of_seq_scanned)


def extract_id_of_seq_scanned(source):
    '''
    Take something like:
    https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa

    -or- 

    /local/directory1/directory2/chrmt.fa

    -or-

    chrmt.fa

    And return:
    chrmt
    '''
    if "/" in source:
        last_bit = source.split("/")[-1]
    else:
        last_bit = source
    if '.' in last_bit:
        main_part_of_name, file_extension = os.path.splitext(
        last_bit) #from http://stackoverflow.com/questions/541390/extracting-extension-from-filename-in-python
        return main_part_of_name
    else:
        return last_bit


def get_seq_from_URL(url):
    '''
    takes a URL and gets the sequence
    '''
    try:
        from StringIO import StringIO
    except ImportError:
        from io import StringIO

    # Getting html originally for just Python 3, adapted from 
    # https://stackoverflow.com/a/17510727/8508004 and then updated from to 
    # handle Python 2 and 3 according to same link.
    try:
        # For Python 3.0 and later
        from urllib.request import urlopen
    except ImportError:
        # Fall back to Python 2's urllib2
        from urllib2 import urlopen
    html = urlopen(url)
    fasta_iterator = SeqIO.parse(StringIO(
        html.read().decode(encoding='UTF-8')), "fasta")
    # Use of `next()` on next line to get first FASTA -formatted sequence is 
    # based on http://biopython.org/DIST/docs/api/Bio.SeqIO-module.html
    # I think difference from `SeqIO.read()` in this approach is that it won't
    # give an error if more than one entry is in the html.
    # I found I needed `StringIO()` or got issues with trying to handle long file name.
    record = next(fasta_iterator)
    return record.seq

def get_fasta_seq(source):
    '''
    Takes a source URL or filepath/ file name and gets sequence if it is in
    FASTA format.
    It won't return anything if what is provided isn't FASTA format because
    Biopython handles both trying to extract FASTA from URL and from file.
    See https://stackoverflow.com/a/44294079/8508004.
    Placing it in a function it easy to then check and provide some feedback.
    '''
    if source.lower().startswith("http"):
        return get_seq_from_URL(source)
    else:
        # Read sequence, treating source as a filepath.
        # Use of `with` on next line based on http://biopython.org/wiki/SeqIO , 
        # under "Sequence Input". Otherwise, backbone based on 
        # https://www.biostars.org/p/209383/, and fact `rU` mode depecated.
        with open(source, "r") as handle:
            for record in SeqIO.parse(handle, "fasta"):
                # print(record.seq) # for debugging
                return record.seq


def search_strand(pattern, sequence_to_scan, strand=1):
    '''
    take a sequence pattern (element) and find occurrences of that on the 
    provided, larger 5'-->3' sequence.
    Assumes strand is first unless provided.

    Tracks the start and end points of each occurrence, returning a list of
    that information where each element is a tuple of the start and end points
    along with the strand.

    Works with overlapped sequences because now 
    "regex.findall and regex.finditer support an ‘overlapped’ flag which 
    permits overlapped matches."
    , see https://pypi.python.org/pypi/regex/2018.02.21

    based on https://www.biostars.org/p/209383/ (specifically steve's answer)
    '''
    occurrences = []
    for match in regex.finditer(
        pattern.upper(), str(sequence_to_scan.upper()),overlapped=True):
        if strand == 1:
            start_pos = match.start() + 1
            end_pos = match.end() + 1
        else:
            start_pos = (len(sequence_to_scan) - match.start() ) + 1
            end_pos = (len(sequence_to_scan) - match.end() ) + 1
        # print (start_pos, '\t', end_pos, '\t',strand) # for debugging
        occurrences.append((start_pos, end_pos,strand))
    return occurrences

###--------------------------END OF HELPER FUNCTIONS---------------------------###
###--------------------------END OF HELPER FUNCTIONS---------------------------###




#*******************************************************************************
###------------------------'main' function of script---------------------------##
def find_sequence_element_occurrences_in_sequence(
    source, element, id_of_seq_element, return_dataframe = False):
    '''
    Main function of script. Scan a sequence (source) and report on occurrences 
    of a sub-sequence element in that sequence.

    Returns None
    Unless `return_dataframe = True`, and then it returns a dataframe of 
    accounting as well. That option being meant when using this script in a cell
    in a Jupyter notebook or importing it into another script.
    '''
    # get the fasta_seq to scan
    fasta_seq = get_fasta_seq(source)


    assert fasta_seq, (
    "The provided source of the FASTA-formatted sequence seems invalid. Is it "
    "FASTA format?")

    # With the approach in this next block, I can expose `id_of_seq_scanned` to 
    # setting for advanced use without it being required and without need to be 
    # passed into the function.
    if id_of_seq_scanned_hardcoded:
        id_of_seq_scanned = id_of_seq_scanned_hardcoded
    else:
        id_of_seq_scanned = extract_id_of_seq_scanned(source)
        


    #assert that element cannot be longer than fasta_seq
    assert len(element) < len(fasta_seq), (
    "the FASTA sequence has to be longer than the provided sequence element.\n"
    "The provided FASTA sequence, {0}, is {1} bases;\nthe provided sequence "
    "element '{2}' is {3} bases.".format(
        id_of_seq_scanned,len(fasta.seq),element,len(element)))

    occurrences = search_strand(element, fasta_seq) # first strand
    occurrences += search_strand(
        element, fasta_seq.reverse_complement(), strand=-1) #2nd strand
    if occurrences:
        # make it into a dataframe since provides convenient options for 
        # handling
        df = pd.DataFrame(occurrences, columns=['start_pos','end_pos','strand'])
        # add some useful information to the dataframe
        df["seq. element"] = get_seq_element_representation(id_of_seq_element)
        df = df[['seq. element','start_pos','end_pos','strand']] # 'seq.element'
        # column will show on far right otherwise

        #print(updated_sites_df)#originally for debugging during development,added..
        # Document the full set of data collected in the terminal or 
        # Jupyter notebook display in some manner. 
        # Using `df.to_string()` because more universal than `print(df)` 
        # or Jupyter's `display(df)`.
        sys.stderr.write( "\nFor documenting purposes, the following lists the "
            "collected occurences:\n")
        #with pd.option_context('display.max_rows', None, 'display.max_columns', None):
        #    display(df)
        sys.stderr.write(df.to_string())

        # write to tab-delimited file
        output_file_name = generate_output_file_name(
            id_of_seq_element,id_of_seq_scanned )
        df.to_csv(output_file_name, sep='\t',index = False)
        sys.stderr.write( "\nThe information on the {0} occurrences of '{1}' "
                "has\nbeen saved as a file named"
                " '{2}'.".format(len(df),element,output_file_name))

        if return_dataframe:
            sys.stderr.write( "\n\nReturning a dataframe with the information "
                "as well.")
            return df
    else:
        sys.stderr.write( "\nNo occurrences of '{0}' "
                "found in the provided sequence.".format(element))
        if return_dataframe:
            sys.stderr.write( "\n\nNo data to return in a dataframe and so "
                "returning `None`.")
            return None


###--------------------------END OF MAIN FUNCTION----------------------------###
###--------------------------END OF MAIN FUNCTION----------------------------###


















#*******************************************************************************
###------------------------'main' section of script---------------------------##

def main():
    """ Main entry point of the script """
    # placing actual main action in a 'helper' script so can call that easily 
    # with a distinguishing name in Jupyter notebooks, where `main()` may get
    # assigned multiple times depending how many scripts imported/pasted in.
    find_sequence_element_occurrences_in_sequence(
        source, element, id_of_seq_element)

        

if __name__ == "__main__" and '__file__' in globals():
    """ This is executed when run from the command line """
    # Code with just `if __name__ == "__main__":` alone will be run if pasted
    # into a notebook. The addition of ` and '__file__' in globals()` is based
    # on https://stackoverflow.com/a/22923872/8508004
    # See also https://stackoverflow.com/a/22424821/8508004 for an option to 
    # provide arguments when prototyping a full script in the notebook.
    ###-----------------for parsing command line arguments-----------------------###
    import argparse
    parser = argparse.ArgumentParser(prog='find_sequence_element_occurrences_in_sequence.py',
        description="find_sequence_element_occurrences_in_sequence.py takes \
        a short sequence represented as a string and a source FASTA-formatted \
        sequence (either a file or URL to a file) and makes an accounting of \
        the occurrences of that short sequence element (pattern) on both \
        strands of the main sequence. Exact matches only are counted as \
        occurrences. Matching is case-insensitive.    \
        **** Script by Wayne Decatur   \
        (fomightez @ github) ***")

    parser.add_argument("element", help="Sequence of element to search for. \
        For example, to search for an EcoRI site, provided `GAATTC`, without \
        any quotes or backticks, in the call to the script.\
        ", metavar="ELEMENT")
    parser.add_argument("source", help="Filepath or URL of FASTA-formatted \
        sequence to scan for occurrences of the sequence element. \
        ", metavar="SOURCE")
    parser.add_argument('-id', '--id_of_seq_element', action='store', type=str, 
        help="**OPTIONAL**Identifier \
        to reference sequence element. If none is provided, the sequence of \
        the user provided element will be used, limited to first {} bases if \
        exceeds that length.".format(limit_of_name))





    #I would also like trigger help to display if no arguments provided because 
    # need at least one for url
    if len(sys.argv)==1:    #from http://stackoverflow.com/questions/4042452/display-help-message-with-python-argparse-when-script-is-called-without-any-argu
        parser.print_help()
        sys.exit(1)
    args = parser.parse_args()
    element = args.element
    source= args.source
    id_of_seq_element= args.id_of_seq_element



    main()

#*******************************************************************************
###-***********************END MAIN PORTION OF SCRIPT***********************-###
#*******************************************************************************
In [3]:
source = "https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa"
element = "GAATTC"
id_of_seq_element = "EcoRI" #Set to `None` without quotes or backticks to have defined automatically
df = find_sequence_element_occurrences_in_sequence(source,element,id_of_seq_element,return_dataframe = True)
For documenting purposes, the following lists the collected occurences:
   seq. element  start_pos  end_pos  strand
0         EcoRI      16724    16730       1
1         EcoRI      17671    17677       1
2         EcoRI      17760    17766       1
3         EcoRI      26412    26418       1
4         EcoRI      26639    26645       1
5         EcoRI      29183    29189       1
6         EcoRI      40987    40993       1
7         EcoRI      44706    44712       1
8         EcoRI      80961    80967       1
9         EcoRI      81838    81844       1
10        EcoRI      81844    81838      -1
11        EcoRI      80967    80961      -1
12        EcoRI      44712    44706      -1
13        EcoRI      40993    40987      -1
14        EcoRI      29189    29183      -1
15        EcoRI      26645    26639      -1
16        EcoRI      26418    26412      -1
17        EcoRI      17766    17760      -1
18        EcoRI      17677    17671      -1
19        EcoRI      16730    16724      -1
The information on the 20 occurrences of 'GAATTC' has
been saved as a file named 'seq_account_EcoRI_chrmt.tsv'.

Returning a dataframe with the information as well.
In [4]:
df
Out[4]:
seq. element start_pos end_pos strand
0 EcoRI 16724 16730 1
1 EcoRI 17671 17677 1
2 EcoRI 17760 17766 1
3 EcoRI 26412 26418 1
4 EcoRI 26639 26645 1
5 EcoRI 29183 29189 1
6 EcoRI 40987 40993 1
7 EcoRI 44706 44712 1
8 EcoRI 80961 80967 1
9 EcoRI 81838 81844 1
10 EcoRI 81844 81838 -1
11 EcoRI 80967 80961 -1
12 EcoRI 44712 44706 -1
13 EcoRI 40993 40987 -1
14 EcoRI 29189 29183 -1
15 EcoRI 26645 26639 -1
16 EcoRI 26418 26412 -1
17 EcoRI 17766 17760 -1
18 EcoRI 17677 17671 -1
19 EcoRI 16730 16724 -1
In [5]:
source = "https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa"
element = "GAATTC"
id_of_seq_element = None #Set to `None` without quotes or backticks to have defined automatically
df = find_sequence_element_occurrences_in_sequence(source,element,id_of_seq_element,return_dataframe = True)
For documenting purposes, the following lists the collected occurences:
   seq. element  start_pos  end_pos  strand
0        GAATTC      16724    16730       1
1        GAATTC      17671    17677       1
2        GAATTC      17760    17766       1
3        GAATTC      26412    26418       1
4        GAATTC      26639    26645       1
5        GAATTC      29183    29189       1
6        GAATTC      40987    40993       1
7        GAATTC      44706    44712       1
8        GAATTC      80961    80967       1
9        GAATTC      81838    81844       1
10       GAATTC      81844    81838      -1
11       GAATTC      80967    80961      -1
12       GAATTC      44712    44706      -1
13       GAATTC      40993    40987      -1
14       GAATTC      29189    29183      -1
15       GAATTC      26645    26639      -1
16       GAATTC      26418    26412      -1
17       GAATTC      17766    17760      -1
18       GAATTC      17677    17671      -1
19       GAATTC      16730    16724      -1
The information on the 20 occurrences of 'GAATTC' has
been saved as a file named 'seq_account_GAATTC_chrmt.tsv'.

Returning a dataframe with the information as well.
In [6]:
source = "https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa"
element = "TTCATACTTTTTATTAATATAATATAATATAATATTATTAATACTTTCT"
id_of_seq_element = None #Set to `None` without quotes or backticks to have defined automatically
df = find_sequence_element_occurrences_in_sequence(source,element,id_of_seq_element,return_dataframe = True)
For documenting purposes, the following lists the collected occurences:
           seq. element  start_pos  end_pos  strand
0  TTCATACTTTTTATTAA...       2705     2754       1
The information on the 1 occurrences of 'TTCATACTTTTTATTAATATAATATAATATAATATTATTAATACTTTCT' has
been saved as a file named 'seq_account_TTCATACTTTTTATTAA-_chrmt.tsv'.

Returning a dataframe with the information as well.
In [7]:
df
Out[7]:
seq. element start_pos end_pos strand
0 TTCATACTTTTTATTAA... 2705 2754 1
In [8]:
source = "https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa"
element = "TTCATACTTTTTATTAA"
id_of_seq_element = None #Set to `None` without quotes or backticks to have defined automatically
df = find_sequence_element_occurrences_in_sequence(source,element,id_of_seq_element,return_dataframe = True)
For documenting purposes, the following lists the collected occurences:
        seq. element  start_pos  end_pos  strand
0  TTCATACTTTTTATTAA       2705     2722       1
The information on the 1 occurrences of 'TTCATACTTTTTATTAA' has
been saved as a file named 'seq_account_TTCATACTTTTTATTAA_chrmt.tsv'.

Returning a dataframe with the information as well.
In [9]:
df
Out[9]:
seq. element start_pos end_pos strand
0 TTCATACTTTTTATTAA 2705 2722 1

For loading the script into a cell from Github, see the 'Skipping pasting by loading into a cell' section of here to understand the paradigm and then call the core function similar to the cells just above this cell.


Running the script with import of the main function

This is similar to the last section,'Running core function of the script after loading into a cell', but here we take advantage of Python's import statement to do what we did by pasting or loading code into a cell and running it.

First insure the script is available where you are running. Running the next command will do that here.

In [10]:
!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/FindSequence/find_sequence_element_occurrences_in_sequence.py
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 18500  100 18500    0     0   240k      0 --:--:-- --:--:-- --:--:--  240k

Then import the main function of the script to the notebook's active computational environment via an import statement.

In [11]:
from find_sequence_element_occurrences_in_sequence import find_sequence_element_occurrences_in_sequence

(As written above the command to do that looks a bit redundant; however, the first from part of the command below actually is referencing the find_sequence_element_occurrences_in_sequence script, but it doesn't need the .py extension because the import only deals with such files.)

With the main function imported, it is now available to be run. As with the above method to run the script, the variables involved need to be defined before calling the function.

In [12]:
source = "https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa"
element = "GAATTC"
id_of_seq_element = "EcoRI" #Set to `None` without quotes or backticks to have defined automatically
df = find_sequence_element_occurrences_in_sequence(source,element,id_of_seq_element,return_dataframe = True)
For documenting purposes, the following lists the collected occurences:
   seq. element  start_pos  end_pos  strand
0         EcoRI      16724    16730       1
1         EcoRI      17671    17677       1
2         EcoRI      17760    17766       1
3         EcoRI      26412    26418       1
4         EcoRI      26639    26645       1
5         EcoRI      29183    29189       1
6         EcoRI      40987    40993       1
7         EcoRI      44706    44712       1
8         EcoRI      80961    80967       1
9         EcoRI      81838    81844       1
10        EcoRI      81844    81838      -1
11        EcoRI      80967    80961      -1
12        EcoRI      44712    44706      -1
13        EcoRI      40993    40987      -1
14        EcoRI      29189    29183      -1
15        EcoRI      26645    26639      -1
16        EcoRI      26418    26412      -1
17        EcoRI      17766    17760      -1
18        EcoRI      17677    17671      -1
19        EcoRI      16730    16724      -1
The information on the 20 occurrences of 'GAATTC' has
been saved as a file named 'seq_account_EcoRI_chrmt.tsv'.

Returning a dataframe with the information as well.

Complex use case example: process multiple searches for sequence elements

In [13]:
!curl -O https://static-content.springer.com/esm/art%3A10.1186%2Fs12864-015-1664-4/MediaObjects/12864_2015_1664_MOESM9_ESM.xlsx
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9726  100  9726    0     0  94427      0 --:--:-- --:--:-- --:--:-- 94427
In [14]:
!pip install xlrd
Requirement already satisfied: xlrd in /srv/conda/envs/notebook/lib/python3.7/site-packages (2.0.1)
In [15]:
import pandas as pd
GC_df = pd.read_excel('12864_2015_1664_MOESM9_ESM.xlsx')
GC_df.rename(columns={"Consensus Sequence 5' -> 3'":'Consensus'}, inplace=True) # I was having difficulty using bracket notation and so this makes easier
GC_df
Out[15]:
Class Consensus
0 M1 TTCCGGGGCCCGGCCACGGGAGCCGGAACCCCGAAAGGAG
1 M1' TTCCCGCTTCGCGGGAACCCCGTAAGGAG
2 M2 TCCGGCCCGCCCCGCGGGGCGGACCCCGAAGGAG
3 M2' TCCCCGCCCCGGCGGGGACCCCGAAGGAG
4 M2'' TCCGGCCGAAGGAG
5 M3 TGAGGGACCCCTCCCTATACTATGGGAGGGGGACCGAACCCTTTAA...
6 M4 TGAACACCTTTATTTAATTATAAAGGTGTGAACCAATCCGCAAGGCAAG
7 G CCCGGTTTCTTACGAAACCGGGACCTCGGAGACGT
8 V CCCCGCGGGCGCCAATCCGGTTGTTCACCGGATTGGTCCCGCGGGG
In [16]:
for row in GC_df.itertuples():
    #print(row) # for debugging
    source = "https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa"
    element = row.Consensus
    id_of_seq_element = row.Class
    df = find_sequence_element_occurrences_in_sequence(source,element,id_of_seq_element,return_dataframe = True)
    #print(df)
For documenting purposes, the following lists the collected occurences:
   seq. element  start_pos  end_pos  strand
0            M1       2097     2137       1
1            M1       3062     3102       1
2            M1       8238     8278       1
3            M1       9034     9074       1
4            M1      11049    11089       1
5            M1      11508    11548       1
6            M1      34289    34329       1
7            M1      34937    34977       1
8            M1      46317    46357       1
9            M1      48104    48144       1
10           M1      50113    50153       1
11           M1      53829    53869       1
12           M1      62864    62904       1
13           M1      66829    66869       1
14           M1      69446    69486       1
15           M1      76597    76637       1
16           M1      80505    80545       1
17           M1      81406    81446       1
18           M1      83056    83096       1
19           M1      75661    75621      -1
20           M1      71254    71214      -1
21           M1      57070    57030      -1
22           M1      52372    52332      -1
23           M1      47515    47475      -1
24           M1       9530     9490      -1
25           M1       2792     2752      -1
26           M1        125       85      -1
The information on the 27 occurrences of 'TTCCGGGGCCCGGCCACGGGAGCCGGAACCCCGAAAGGAG' has
been saved as a file named 'seq_account_M1_chrmt.tsv'.

Returning a dataframe with the information as well.
No occurrences of 'TTCCCGCTTCGCGGGAACCCCGTAAGGAG' found in the provided sequence.

No data to return in a dataframe and so returning `None`.
For documenting purposes, the following lists the collected occurences:
  seq. element  start_pos  end_pos  strand
0           M2      65101    65067      -1
The information on the 1 occurrences of 'TCCGGCCCGCCCCGCGGGGCGGACCCCGAAGGAG' has
been saved as a file named 'seq_account_M2_chrmt.tsv'.

Returning a dataframe with the information as well.
For documenting purposes, the following lists the collected occurences:
  seq. element  start_pos  end_pos  strand
0          M2'       2855     2884       1
1          M2'      34171    34200       1
2          M2'      73206    73235       1
3          M2'      68689    68660      -1
4          M2'      56701    56672      -1
5          M2'       4183     4154      -1
The information on the 6 occurrences of 'TCCCCGCCCCGGCGGGGACCCCGAAGGAG' has
been saved as a file named 'seq_account_M2'_chrmt.tsv'.

Returning a dataframe with the information as well.
For documenting purposes, the following lists the collected occurences:
  seq. element  start_pos  end_pos  strand
0         M2''      57860    57874       1
1         M2''      84762    84748      -1
2         M2''      72792    72778      -1
3         M2''      65067    65053      -1
4         M2''       8668     8654      -1
The information on the 5 occurrences of 'TCCGGCCGAAGGAG' has
been saved as a file named 'seq_account_M2''_chrmt.tsv'.

Returning a dataframe with the information as well.
No occurrences of 'TGAGGGACCCCTCCCTATACTATGGGAGGGGGACCGAACCCTTTAAAGAAGAG' found in the provided sequence.

No data to return in a dataframe and so returning `None`.
No occurrences of 'TGAACACCTTTATTTAATTATAAAGGTGTGAACCAATCCGCAAGGCAAG' found in the provided sequence.

No data to return in a dataframe and so returning `None`.
For documenting purposes, the following lists the collected occurences:
  seq. element  start_pos  end_pos  strand
0            G      30185    30220       1
1            G      44891    44926       1
2            G      12816    12781      -1
3            G       4348     4313      -1
The information on the 4 occurrences of 'CCCGGTTTCTTACGAAACCGGGACCTCGGAGACGT' has
been saved as a file named 'seq_account_G_chrmt.tsv'.

Returning a dataframe with the information as well.
For documenting purposes, the following lists the collected occurences:
  seq. element  start_pos  end_pos  strand
0            V      47719    47765       1
1            V      49089    49135       1
The information on the 2 occurrences of 'CCCCGCGGGCGCCAATCCGGTTGTTCACCGGATTGGTCCCGCGGGG' has
been saved as a file named 'seq_account_V_chrmt.tsv'.

Returning a dataframe with the information as well.