Notebook

Examples of running the script¶

There are several options to run UCSC_chrom_sizes_2_circos_karyotype.py or utilize the core function.
This notebook will demonstrate the following:

Running as script file
Running core function of the script after loading into a cell via one or two alternative routes
Running core function of the script after import

(If you are having any problems at all doing any of this, this notebook was developed in the environment launchable by pressing Launch binder badge here. You could always launch that environment and upload this notebook there and things should work.)

Running as script file¶

Similar to how one would run a script from the command line. (Aspects of that are reviewed in this section, too.)

Upload the script to the directory where you want to run it. Or upload it to a running Jupyter environment.

(For the sake of this demonstration, I am going to use curl to get the file from github and upload it to the 'local' environment. You of course can use whatever download and upload steps you'd like, such as using a browser and your system's graphical user interface, to place the script in the directory. 'local' is in parentheses because if running this in a Jupyter interface via the Binder system, 'local' would be inside the running enviroment.)

In [1]:

!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/circos-utilities/UCSC_chrom_sizes_2_circos_karyotype.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 16358  100 16358    0     0  91898      0 --:--:-- --:--:-- --:--:-- 91385

That command would work on the command line without the exclamation point. The use of the exclamation point signals here to not treat it as Python code and instead target the command to the available command line shell.

THEN AFTER UPLOADED...
If running on the command line then you would enter:

python UCSC_chrom_sizes_2_circos_karyotype.py http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.chrom.sizes

Or something similar to that depending on your Python environment and source of data. (See more about that here.)

Similarly you can do that in the Jupyter environment using either either !python before the script name or using the %run magic command.
The %run magic command is demonstrated in the next cell. If you are in an active Jupyter environment, to run it click on the next cell and type shift-enter or click run on the toolbar above the notebook.

In [2]:

%run UCSC_chrom_sizes_2_circos_karyotype.py --help

usage: UCSC_chrom_sizes_2_circos_karyotype.py [-h] [-sc SPECIES_CODE]
                                              URL [OUTPUT_FILE]

UCSC_chrom_sizes_2_circos_karyotype.py takes a URL for a UCSC chrom.sizes file
and makes a karyotype.tab file. **** Script by Wayne Decatur (fomightez @
github) ***

positional arguments:
  URL                   URL of chrom.sizes file at UCSC.
  OUTPUT_FILE           **OPTIONAL**Name of file for storing the karyotype. If
                        none is provided, the karyotype will be stored as
                        'karyotype.tab'.

optional arguments:
  -h, --help            show this help message and exit
  -sc SPECIES_CODE, --species_code SPECIES_CODE
                        **OPTIONAL**Identifier to use in front of chromosome
                        names. An attempt will be made to extract one if
                        nothing is provided & that is why it's optional.

In the next cell is an example where actual arguments are provided as outlined in the USAGE shown as the output from in the above cell due to running with the script with the --help/-h flag. See here for help with coming up with your own parameters to pass the script.

In [3]:

%run UCSC_chrom_sizes_2_circos_karyotype.py http://hgdownload.cse.ucsc.edu/goldenPath/canFam2/bigZips/canFam2.chrom.sizes dog_karyotype.tab --species_code dog

The following species code will be used in the ID column in the
produced karyotype file: 'dog'.

The karyotype file for 41 chromosomes has been saved as a file named 'dog_karyotype.tab'.

Running core function of the script after loading into a cell¶

It can be pasted into a cell or loaded from github. Those will be demonstrated in this section of the notebook.

First part of this section will cover pasting the script into a cell.
In the next cell is the script (althought it might not be the most up-to-date version, and so it would be best to get and paste the most-up-to-date version from Github or use the %loadapproach to fetch it directly into the cell as discussed below.)

In [4]:

#!/usr/bin/env python
# UCSC_chrom_sizes_2_circos_karyotype.py
__author__ = "Wayne Decatur" #fomightez on GitHub
__license__ = "MIT"
__version__ = "0.2.0"


# UCSC_chrom_sizes_2_circos_karyotype.py by Wayne Decatur
# ver 0.2
#
#*******************************************************************************
# Verified compatible with both Python 2.7 and Python 3.6; written initially in 
# Python 3.
#
# PURPOSE: Takes a URL for a UCSC `chrom.sizes` file and makes a `karyotype.tab` 
# file from it for use with Circos.
# Note: to determine the URL, google `YOUR_ORGANISM genome UCSC chrom.sizes`, 
# where you replace `YOUR_ORGANISM` with your organism name and then
# adapt the path you see in the best match to be something similar to 
# "http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.chrom.sizes"
# -or-
# "http://hgdownload.cse.ucsc.edu/goldenPath/canFam2/bigZips/canFam2.chrom.sizes"
# 
# IMPORTANTLY, this script is intended for organisms without cytogenetic bands, 
# such as dog, cow, yeast, etc..
# Acquiring the cytogenetic bands information is described at 
# http://circos.ca/tutorials/lessons/ideograms/karyotypes/ , about halfway down 
# the page where it says, "obtain the karyotype structure from...". 
# Unfortunately, it seems the output directed to by those instructions is not
# directly useful in Circos(?). Fortunately, though as described at 
# http://circos.ca/documentation/tutorials/quick_start/hello_world/ 
# ,"Circos ships with several predefined karyotype files for common sequence 
# assemblies: human, mouse, rat, and drosophila. These files are located in 
# data/karyotype within the Circos distribution."
#
# Written to run from command line or pasted/loaded inside a Jupyter notebook 
# cell. 
#
#
#
# This script based on work and musings developed in 
# `Trying to convert k75.Umap.bedGraph to bigwig file that works at SGD jbrowse.md` 
# (specifically use of chrom.sizes) and 
# `Resources in regards to plotting information on presence or absence of signal on circular chromosome circos.md` 
# (where was describing issues with getting karyotype) and 
# http://circos.ca/tutorials/course/handouts/session-4.pdf (that shows first 
# part of Saccharomyces cerevisiae karyptype on page 6).
#
# Example input from 
# http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.chrom.sizes:
'''
chrIV   1531933
chrXV   1091291
chrVII  1090940
chrXII  1078177
chrXVI  948066
chrXIII 924431
chrII   813184
chrXIV  784333
chrX    745751
chrXI   666816
chrV    576874
chrVIII 562643
chrIX   439888
chrIII  316620
chrVI   270161
chrI    230218
chrM    85779
'''

#
#Example output (tab-separated):
'''
chr -   Sc-chrIV    chrIV   0   1531933 black
chr -   Sc-chrXV    chrXV   0   1091291 black
chr -   Sc-chrVII   chrVII  0   1090940 black
chr -   Sc-chrXII   chrXII  0   1078177 black
chr -   Sc-chrXVI   chrXVI  0   948066  black
chr -   Sc-chrXIII  chrXIII 0   924431  black
chr -   Sc-chrII    chrII   0   813184  black
chr -   Sc-chrXIV   chrXIV  0   784333  black
chr -   Sc-chrX chrX    0   745751  black
chr -   Sc-chrXI    chrXI   0   666816  black
chr -   Sc-chrV chrV    0   576874  black
chr -   Sc-chrVIII  chrVIII 0   562643  black
chr -   Sc-chrIX    chrIX   0   439888  black
chr -   Sc-chrIII   chrIII  0   316620  black
chr -   Sc-chrVI    chrVI   0   270161  black
chr -   Sc-chrI chrI    0   230218  black
chr -   Sc-chrM chrM    0   85779   black
'''

#
#
# Dependencies beyond the mostly standard libraries/modules:
#
#
#
# VERSION HISTORY:
# v.0.1. basic working version
# v.0.2. removed references to `http://hgdownload-test.cse.ucsc.edu/..` because 
# seems UCSC has removed the `-test` part so that it is now
# `https://hgdownload.cse.ucsc.edu/...`
#
# To do:
# - probably would be nice to add automated handling of ordering by increasing 
# chromosome number. (I've used detection of roman numerals before, see 
# `plot_expression_across_chromosomes.py) Because would need to be able to 
# store and sort, probably putting the chromosomes and lengths in a dataframe 
# instead would be a good route. Then could write a function to iterrows and 
# write the output lines.
# - possible to do: automate making ones for ones with cytogenetic bands, or is
# there not enough aside from the ones included?
#
#
#
#
# TO RUN:
# Examples,
# Enter on the command line of your terminal, the line
#-----------------------------------
# python UCSC_chrom_sizes_2_circos_karyotype.py http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.chrom.sizes

#-OR-
# python UCSC_chrom_sizes_2_circos_karyotype.py http://hgdownload.cse.ucsc.edu/goldenPath/canFam2/bigZips/canFam2.chrom.sizes dog_karyotype.tab --species_code dog
#-----------------------------------
# Issue `python UCSC_chrom_sizes_2_circos_karyotype.py -h` for details.
# 
#
# To use this after pasting or loading into a cell in a Jupyter notebook, in
# the next cell define the URL and then call the main function similar to below:
# url = "http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.chrom.sizes"
# UCSC_chrom_sizes_2_circos_karyotype(species_code)
#
#(`species_code_hardcoded` and `output_file_name `can be assigned in a cell 
# before calling the function as well.)
#
# Note that `url` is actually not needed if you are using the yeast one because 
# that specific one is hardcoded in script as default.
# In fact due to fact I hardcoded in defaults, just `main()` will indeed work 
# for yeast.
#
# 
#
'''
CURRENT ACTUAL CODE FOR RUNNING/TESTING IN A NOTEBOOK WHEN LOADED OR PASTED IN 
ANOTHER CELL:
UCSC_chrom_sizes_2_circos_karyotype()

-OR, just-

main()

'''
#
#
#*******************************************************************************
#





#*******************************************************************************
##################################
#  USER ADJUSTABLE VALUES        #

##################################
#
## default URL
url = "http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.chrom.sizes" 
output_file_name = "karyotype.tab"

species_code = None # replace `None` with what you want to use,
# with flanking quotes if something appropriate is not being extracted from the
# provided URL to be used as the species code.





#
#*******************************************************************************
#**********************END USER ADJUSTABLE VARIABLES****************************


















#*******************************************************************************
#*******************************************************************************
###DO NOT EDIT BELOW HERE - ENTER VALUES ABOVE###

import sys
import os




###---------------------------HELPER FUNCTIONS---------------------------------###


def make_and_save_karyotype(chromosomes_and_length, species_code):
    '''
    Takes a dictionary of chromosome identifiers and length and makes a karyotype
    file with that information.

    Result will look like this at start of output file:
    chr - Sc-chrIV chrIV 0 1531933 black
    chr - Sc-chrXV chrXV 0 1091291 black
    ...

    Function returns None.
    '''
    # prepare output file for saving so it will be open and ready
    with open(output_file_name, 'w') as output_file:
        for indx,(chrom,length) in enumerate(chromosomes_and_length.items()):
            next_line = ("chr\t-\t{species_code}-{chrom}\t{chrom}\t0"
                "\t{length}\tblack".format(
                species_code=species_code,chrom=chrom, length=length))
            if indx < (len(chromosomes_and_length)-1):
                next_line += "\n" # don't add new line character to last line
            # Send the built line to output
            output_file.write(next_line)
    sys.stderr.write( "\n\nThe karyotype file for {} chromosomes has been saved "
            "as a file named"
            " '{}'.".format(len(chromosomes_and_length),output_file_name))


def extract_species_code_fromUCSC_URL(url):
    '''
    Take something like:
    https://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.chrom.sizes

    And return:
    sacCer

    Note:
    I decided to use `''.join([i for i in s if not i.isdigit()])`, where s is a
    aprovided string, to toss digits.
    '''
    species_code = url.split("goldenPath")[1].split("/")[1]
    return ''.join([i for i in species_code if not i.isdigit()]) # remove digits
###--------------------------END OF HELPER FUNCTIONS---------------------------###
###--------------------------END OF HELPER FUNCTIONS---------------------------###

#*******************************************************************************
###------------------------'main' function of script---------------------------##
# This switch below about `species_code_hardcoded` added here so that above the 
# user can see they can edit it under 'END USER ADJUSTABLE VARIABLES' to make it 
# a string, but I want it tp default to `False` when not set to make checking 
# status easier.
if species_code == "None":
    species_code = False

def UCSC_chrom_sizes_2_circos_karyotype(url=url, species_code=species_code):
    '''
    Main function of script. Will use url to get `chrom.sizes` file from UCSC 
    and use that to make a karyotype file for use in Circos.
    Saves the file as tab-separated values with the extension `.tab`, by
    default, to be consistent with what Circos ecosystem seems to use.

    Default url is the yeast one if calling function from inside Juputer or
    IPyhon.

    Optionally a string can be provided in the call to the function to be used
    as species in place of the one extracted automatically. Example: 
    `species_code = "doggie"`

    Returns: None
    '''
    # Get data from URL.
    chromosomes_and_length = {}
    # Getting html originally for just Python 3, adapted from 
    # https://stackoverflow.com/a/17510727/8508004 and then updated from to 
    # handle Python 2 and 3 according to same link.
    try:
        # For Python 3.0 and later
        from urllib.request import urlopen
    except ImportError:
        # Fall back to Python 2's urllib2
        from urllib2 import urlopen
    html = urlopen(url)
    for line in html.read().splitlines():
        #chromosome, chr_len, *_ = line.strip().split()
        # that elegant unpack above is based on 
        # https://stackoverflow.com/questions/11371204/unpack-the-first-two-elements-in-list-tuple
        # , but it won't work in Python 2. From same place, one that works in 2:
        chromosome, chr_len = line.strip().split()[:2]
        chromosomes_and_length[chromosome.decode(
            encoding='UTF-8')] = chr_len.decode(encoding='UTF-8')



    # Parse the URL for a genus/species -type identifier. (If one not provided.)
    # Note part of keeping URL separate is so that I parse it to parse out from URL 
    # first part of genus-species identifier. Here in development version that is 
    # `sacCer3`, for yeast Saccharmyces cerevisiae. Parsing
    # because of advice [here](http://circos.ca/documentation/tutorials/ideograms/karyotypes/), 
    # "Even when working with only one species, prefixing the chromosome with a 
    # species code is highly recommended - this will greatly help in creating 
    # more transparent configuration and data files."
    if species_code:
        species_code = species_code
        sys.stderr.write( "\nThe following "
                "species code will be used in the ID column "
                "in the\nproduced karyotype file: '{}'.".format(species_code))
    else:
        species_code = extract_species_code_fromUCSC_URL(url)
        if species_code == "sacCer":
            species_code = "Sc" # CUSTOMIZING; I'd prefer to use this for yeast.
        sys.stderr.write( "\nBased on the provided URL, the following "
                "species code will be used in the\nID column "
                "in the karyotype file: '{}'.\n"
                "If that is not suitable, you can re-run the script and "
                "provide one when calling\nthe script using the "
                "`--species_code` flag. Alternatively, edit "
                "the produced file with find/replace.".format(species_code))
    # With the approach in that above block, I can expose `species_code` to 
    # setting for advanced use without it being required and without need to be 
    # passed into the function.




    # Now use the data to make a karyotype file as described at 
    # http://circos.ca/documentation/tutorials/ideograms/karyotypes/ and like 
    # on page 6 of http://circos.ca/tutorials/course/handouts/session-4.pdf
    make_and_save_karyotype(chromosomes_and_length, species_code)
###--------------------------END OF MAIN FUNCTION----------------------------###
###--------------------------END OF MAIN FUNCTION----------------------------###














#*******************************************************************************
###------------------------'main' section of script---------------------------##

def main():
    """ Main entry point of the script """
    # placing actual main action in a 'helper'script so can call that easily 
    # with a distinguishing name in Jupyter notebooks, where `main()` may get
    # assigned multiple times depending how many scripts imported/pasted in.
    UCSC_chrom_sizes_2_circos_karyotype(url,species_code)
        







if __name__ == "__main__" and '__file__' in globals():
    """ This is executed when run from the command line """
    # Code with just `if __name__ == "__main__":` alone will be run if pasted
    # into a notebook. The addition of ` and '__file__' in globals()` is based
    # on https://stackoverflow.com/a/22923872/8508004
    # See also https://stackoverflow.com/a/22424821/8508004 for an option to 
    # provide arguments when prototyping a full script in the notebook.
    ###-----------------for parsing command line arguments-----------------------###
    import argparse
    parser = argparse.ArgumentParser(prog='UCSC_chrom_sizes_2_circos_karyotype.py',
        description="UCSC_chrom_sizes_2_circos_karyotype.py takes a URL for a \
        UCSC chrom.sizes file and makes a karyotype.tab file.      \
        **** Script by Wayne Decatur   \
        (fomightez @ github) ***")

    parser.add_argument("URL", help="URL of chrom.sizes file at UCSC. \
        ", metavar="URL")
    parser.add_argument('-sc', '--species_code', action='store', type=str, 
        default= species_code, help="**OPTIONAL**Identifier \
        to use in front of chromosome names. An attempt will be made to extract \
        one if nothing is provided & that is why it's optional.")
    parser.add_argument("output", nargs='?', help="**OPTIONAL**Name of file \
        for storing the karyotype. If none is provided, the karyotype will be \
        stored as '"+output_file_name+"'.", 
        default=output_file_name , metavar="OUTPUT_FILE")

    # See
    # https://stackoverflow.com/questions/4480075/argparse-optional-positional-arguments 
    # and 
    # https://docs.python.org/2/library/argparse.html#nargs for use of `nargs='?'` 
    # to make output file name optional. Note that the square brackets
    # shown in the usage out signify optional according to 
    # https://stackoverflow.com/questions/4480075/argparse-optional-positional-arguments#comment40460395_4480202
    # , but because placed under positional I added clarifying text to help 
    # description.
    # IF MODIFYING THIS SCRIPT FOR USE ELSEWHERE AND DON'T NEED/WANT THE OUTPUT 
    # FILE TO BE OPTIONAL, remove `nargs` (& default?) BUT KEEP WHERE NOT
    # USING `argparse.FileType` AND USING `with open` AS CONISDERED MORE PYTHONIC.



    #I would also like trigger help to display if no arguments provided because 
    # need at least one for url
    if len(sys.argv)==1:    #from http://stackoverflow.com/questions/4042452/display-help-message-with-python-argparse-when-script-is-called-without-any-argu
        parser.print_help()
        sys.exit(1)
    args = parser.parse_args()
    url= args.URL
    output_file_name = args.output
    species_code = args.species_code


    main()

#*******************************************************************************
###-***********************END MAIN PORTION OF SCRIPT***********************-###
#*******************************************************************************

Now we have options for calling the core function of the script. The next two cells demonstrate that.

In [5]:

UCSC_chrom_sizes_2_circos_karyotype()

Based on the provided URL, the following species code will be used in the
ID column in the karyotype file: 'Sc'.
If that is not suitable, you can re-run the script and provide one when calling
the script using the `--species_code` flag. Alternatively, edit the produced file with find/replace.

The karyotype file for 17 chromosomes has been saved as a file named 'karyotype.tab'.

To provide your own settings, set the variables the script will use before calling it, like so:

In [6]:

url="http://hgdownload.cse.ucsc.edu/goldenPath/canFam2/bigZips/canFam2.chrom.sizes"
output_file_name = "dog_karyotype.tab"
species_code_hardcoded = "dog"
UCSC_chrom_sizes_2_circos_karyotype()

Based on the provided URL, the following species code will be used in the
ID column in the karyotype file: 'Sc'.
If that is not suitable, you can re-run the script and provide one when calling
the script using the `--species_code` flag. Alternatively, edit the produced file with find/replace.

The karyotype file for 17 chromosomes has been saved as a file named 'dog_karyotype.tab'.

Skipping pasting by loading into a cell

Next to load the script into a cell direct from github, you need the URL of the raw script and then you can use the load magic command in a cell, like:

%load https://raw.githubusercontent.com/fomightez/sequencework/master/circos-utilities/UCSC_chrom_sizes_2_circos_karyotype.py

(Note it is possible to use that method to get a specific version of the script, see my comment here, and indeed that may be the best option if you are looking for reproducibility.)

Actually doing that will result in a cell that looks like the following because after the contents are loaded, the %load command is commented out and the contents of the cell are the script:

In [7]:

# %load https://raw.githubusercontent.com/fomightez/sequencework/master/circos-utilities/UCSC_chrom_sizes_2_circos_karyotype.py
#!/usr/bin/env python
# UCSC_chrom_sizes_2_circos_karyotype.py
__author__ = "Wayne Decatur" #fomightez on GitHub
__license__ = "MIT"
__version__ = "0.2.0"


# UCSC_chrom_sizes_2_circos_karyotype.py by Wayne Decatur
# ver 0.2
#
#*******************************************************************************
# Verified compatible with both Python 2.7 and Python 3.6; written initially in 
# Python 3.
#
# PURPOSE: Takes a URL for a UCSC `chrom.sizes` file and makes a `karyotype.tab` 
# file from it for use with Circos.
# Note: to determine the URL, google `YOUR_ORGANISM genome UCSC chrom.sizes`, 
# where you replace `YOUR_ORGANISM` with your organism name and then
# adapt the path you see in the best match to be something similar to 
# "http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.chrom.sizes"
# -or-
# "http://hgdownload.cse.ucsc.edu/goldenPath/canFam2/bigZips/canFam2.chrom.sizes"
# 
# IMPORTANTLY, this script is intended for organisms without cytogenetic bands, 
# such as dog, cow, yeast, etc..
# Acquiring the cytogenetic bands information is described at 
# http://circos.ca/tutorials/lessons/ideograms/karyotypes/ , about halfway down 
# the page where it says, "obtain the karyotype structure from...". 
# Unfortunately, it seems the output directed to by those instructions is not
# directly useful in Circos(?). Fortunately, though as described at 
# http://circos.ca/documentation/tutorials/quick_start/hello_world/ 
# ,"Circos ships with several predefined karyotype files for common sequence 
# assemblies: human, mouse, rat, and drosophila. These files are located in 
# data/karyotype within the Circos distribution."
#
# Written to run from command line or pasted/loaded inside a Jupyter notebook 
# cell. 
#
#
#
# This script based on work and musings developed in 
# `Trying to convert k75.Umap.bedGraph to bigwig file that works at SGD jbrowse.md` 
# (specifically use of chrom.sizes) and 
# `Resources in regards to plotting information on presence or absence of signal on circular chromosome circos.md` 
# (where was describing issues with getting karyotype) and 
# http://circos.ca/tutorials/course/handouts/session-4.pdf (that shows first 
# part of Saccharomyces cerevisiae karyptype on page 6).
#
# Example input from 
# http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.chrom.sizes:
'''
chrIV   1531933
chrXV   1091291
chrVII  1090940
chrXII  1078177
chrXVI  948066
chrXIII 924431
chrII   813184
chrXIV  784333
chrX    745751
chrXI   666816
chrV    576874
chrVIII 562643
chrIX   439888
chrIII  316620
chrVI   270161
chrI    230218
chrM    85779
'''

#
#Example output (tab-separated):
'''
chr -   Sc-chrIV    chrIV   0   1531933 black
chr -   Sc-chrXV    chrXV   0   1091291 black
chr -   Sc-chrVII   chrVII  0   1090940 black
chr -   Sc-chrXII   chrXII  0   1078177 black
chr -   Sc-chrXVI   chrXVI  0   948066  black
chr -   Sc-chrXIII  chrXIII 0   924431  black
chr -   Sc-chrII    chrII   0   813184  black
chr -   Sc-chrXIV   chrXIV  0   784333  black
chr -   Sc-chrX chrX    0   745751  black
chr -   Sc-chrXI    chrXI   0   666816  black
chr -   Sc-chrV chrV    0   576874  black
chr -   Sc-chrVIII  chrVIII 0   562643  black
chr -   Sc-chrIX    chrIX   0   439888  black
chr -   Sc-chrIII   chrIII  0   316620  black
chr -   Sc-chrVI    chrVI   0   270161  black
chr -   Sc-chrI chrI    0   230218  black
chr -   Sc-chrM chrM    0   85779   black
'''

#
#
# Dependencies beyond the mostly standard libraries/modules:
#
#
#
# VERSION HISTORY:
# v.0.1. basic working version
# v.0.2. removed references to `http://hgdownload-test.cse.ucsc.edu/..` because 
# seems UCSC has removed the `-test` part so that it is now
# `https://hgdownload.cse.ucsc.edu/...`
#
# To do:
# - probably would be nice to add automated handling of ordering by increasing 
# chromosome number. (I've used detection of roman numerals before, see 
# `plot_expression_across_chromosomes.py) Because would need to be able to 
# store and sort, probably putting the chromosomes and lengths in a dataframe 
# instead would be a good route. Then could write a function to iterrows and 
# write the output lines.
# - possible to do: automate making ones for ones with cytogenetic bands, or is
# there not enough aside from the ones included?
#
#
#
#
# TO RUN:
# Examples,
# Enter on the command line of your terminal, the line
#-----------------------------------
# python UCSC_chrom_sizes_2_circos_karyotype.py http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.chrom.sizes

#-OR-
# python UCSC_chrom_sizes_2_circos_karyotype.py http://hgdownload.cse.ucsc.edu/goldenPath/canFam2/bigZips/canFam2.chrom.sizes dog_karyotype.tab --species_code dog
#-----------------------------------
# Issue `python UCSC_chrom_sizes_2_circos_karyotype.py -h` for details.
# 
#
# To use this after pasting or loading into a cell in a Jupyter notebook, in
# the next cell define the URL and then call the main function similar to below:
# url = "http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.chrom.sizes"
# UCSC_chrom_sizes_2_circos_karyotype(species_code)
#
#(`species_code_hardcoded` and `output_file_name `can be assigned in a cell 
# before calling the function as well.)
#
# Note that `url` is actually not needed if you are using the yeast one because 
# that specific one is hardcoded in script as default.
# In fact due to fact I hardcoded in defaults, just `main()` will indeed work 
# for yeast.
#
# 
#
'''
CURRENT ACTUAL CODE FOR RUNNING/TESTING IN A NOTEBOOK WHEN LOADED OR PASTED IN 
ANOTHER CELL:
UCSC_chrom_sizes_2_circos_karyotype()

-OR, just-

main()

'''
#
#
#*******************************************************************************
#





#*******************************************************************************
##################################
#  USER ADJUSTABLE VALUES        #

##################################
#
## default URL
url = "http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.chrom.sizes" 
output_file_name = "karyotype.tab"

species_code = None # replace `None` with what you want to use,
# with flanking quotes if something appropriate is not being extracted from the
# provided URL to be used as the species code.





#
#*******************************************************************************
#**********************END USER ADJUSTABLE VARIABLES****************************


















#*******************************************************************************
#*******************************************************************************
###DO NOT EDIT BELOW HERE - ENTER VALUES ABOVE###

import sys
import os




###---------------------------HELPER FUNCTIONS---------------------------------###


def make_and_save_karyotype(chromosomes_and_length, species_code):
    '''
    Takes a dictionary of chromosome identifiers and length and makes a karyotype
    file with that information.

    Result will look like this at start of output file:
    chr - Sc-chrIV chrIV 0 1531933 black
    chr - Sc-chrXV chrXV 0 1091291 black
    ...

    Function returns None.
    '''
    # prepare output file for saving so it will be open and ready
    with open(output_file_name, 'w') as output_file:
        for indx,(chrom,length) in enumerate(chromosomes_and_length.items()):
            next_line = ("chr\t-\t{species_code}-{chrom}\t{chrom}\t0"
                "\t{length}\tblack".format(
                species_code=species_code,chrom=chrom, length=length))
            if indx < (len(chromosomes_and_length)-1):
                next_line += "\n" # don't add new line character to last line
            # Send the built line to output
            output_file.write(next_line)
    sys.stderr.write( "\n\nThe karyotype file for {} chromosomes has been saved "
            "as a file named"
            " '{}'.".format(len(chromosomes_and_length),output_file_name))


def extract_species_code_fromUCSC_URL(url):
    '''
    Take something like:
    https://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.chrom.sizes

    And return:
    sacCer

    Note:
    I decided to use `''.join([i for i in s if not i.isdigit()])`, where s is a
    aprovided string, to toss digits.
    '''
    species_code = url.split("goldenPath")[1].split("/")[1]
    return ''.join([i for i in species_code if not i.isdigit()]) # remove digits
###--------------------------END OF HELPER FUNCTIONS---------------------------###
###--------------------------END OF HELPER FUNCTIONS---------------------------###

#*******************************************************************************
###------------------------'main' function of script---------------------------##
# This switch below about `species_code_hardcoded` added here so that above the 
# user can see they can edit it under 'END USER ADJUSTABLE VARIABLES' to make it 
# a string, but I want it tp default to `False` when not set to make checking 
# status easier.
if species_code == "None":
    species_code = False

def UCSC_chrom_sizes_2_circos_karyotype(url=url, species_code=species_code):
    '''
    Main function of script. Will use url to get `chrom.sizes` file from UCSC 
    and use that to make a karyotype file for use in Circos.
    Saves the file as tab-separated values with the extension `.tab`, by
    default, to be consistent with what Circos ecosystem seems to use.

    Default url is the yeast one if calling function from inside Juputer or
    IPyhon.

    Optionally a string can be provided in the call to the function to be used
    as species in place of the one extracted automatically. Example: 
    `species_code = "doggie"`

    Returns: None
    '''
    # Get data from URL.
    chromosomes_and_length = {}
    # Getting html originally for just Python 3, adapted from 
    # https://stackoverflow.com/a/17510727/8508004 and then updated from to 
    # handle Python 2 and 3 according to same link.
    try:
        # For Python 3.0 and later
        from urllib.request import urlopen
    except ImportError:
        # Fall back to Python 2's urllib2
        from urllib2 import urlopen
    html = urlopen(url)
    for line in html.read().splitlines():
        #chromosome, chr_len, *_ = line.strip().split()
        # that elegant unpack above is based on 
        # https://stackoverflow.com/questions/11371204/unpack-the-first-two-elements-in-list-tuple
        # , but it won't work in Python 2. From same place, one that works in 2:
        chromosome, chr_len = line.strip().split()[:2]
        chromosomes_and_length[chromosome.decode(
            encoding='UTF-8')] = chr_len.decode(encoding='UTF-8')



    # Parse the URL for a genus/species -type identifier. (If one not provided.)
    # Note part of keeping URL separate is so that I parse it to parse out from URL 
    # first part of genus-species identifier. Here in development version that is 
    # `sacCer3`, for yeast Saccharmyces cerevisiae. Parsing
    # because of advice [here](http://circos.ca/documentation/tutorials/ideograms/karyotypes/), 
    # "Even when working with only one species, prefixing the chromosome with a 
    # species code is highly recommended - this will greatly help in creating 
    # more transparent configuration and data files."
    if species_code:
        species_code = species_code
        sys.stderr.write( "\nThe following "
                "species code will be used in the ID column "
                "in the\nproduced karyotype file: '{}'.".format(species_code))
    else:
        species_code = extract_species_code_fromUCSC_URL(url)
        if species_code == "sacCer":
            species_code = "Sc" # CUSTOMIZING; I'd prefer to use this for yeast.
        sys.stderr.write( "\nBased on the provided URL, the following "
                "species code will be used in the\nID column "
                "in the karyotype file: '{}'.\n"
                "If that is not suitable, you can re-run the script and "
                "provide one when calling\nthe script using the "
                "`--species_code` flag. Alternatively, edit "
                "the produced file with find/replace.".format(species_code))
    # With the approach in that above block, I can expose `species_code` to 
    # setting for advanced use without it being required and without need to be 
    # passed into the function.




    # Now use the data to make a karyotype file as described at 
    # http://circos.ca/documentation/tutorials/ideograms/karyotypes/ and like 
    # on page 6 of http://circos.ca/tutorials/course/handouts/session-4.pdf
    make_and_save_karyotype(chromosomes_and_length, species_code)
###--------------------------END OF MAIN FUNCTION----------------------------###
###--------------------------END OF MAIN FUNCTION----------------------------###














#*******************************************************************************
###------------------------'main' section of script---------------------------##

def main():
    """ Main entry point of the script """
    # placing actual main action in a 'helper'script so can call that easily 
    # with a distinguishing name in Jupyter notebooks, where `main()` may get
    # assigned multiple times depending how many scripts imported/pasted in.
    UCSC_chrom_sizes_2_circos_karyotype(url,species_code)
        







if __name__ == "__main__" and '__file__' in globals():
    """ This is executed when run from the command line """
    # Code with just `if __name__ == "__main__":` alone will be run if pasted
    # into a notebook. The addition of ` and '__file__' in globals()` is based
    # on https://stackoverflow.com/a/22923872/8508004
    # See also https://stackoverflow.com/a/22424821/8508004 for an option to 
    # provide arguments when prototyping a full script in the notebook.
    ###-----------------for parsing command line arguments-----------------------###
    import argparse
    parser = argparse.ArgumentParser(prog='UCSC_chrom_sizes_2_circos_karyotype.py',
        description="UCSC_chrom_sizes_2_circos_karyotype.py takes a URL for a \
        UCSC chrom.sizes file and makes a karyotype.tab file.      \
        **** Script by Wayne Decatur   \
        (fomightez @ github) ***")

    parser.add_argument("URL", help="URL of chrom.sizes file at UCSC. \
        ", metavar="URL")
    parser.add_argument('-sc', '--species_code', action='store', type=str, 
        default= species_code_hardcoded, help="**OPTIONAL**Identifier \
        to use in front of chromosome names. An attempt will be made to extract \
        one if nothing is provided & that is why it's optional.")
    parser.add_argument("output", nargs='?', help="**OPTIONAL**Name of file \
        for storing the karyotype. If none is provided, the karyotype will be \
        stored as '"+output_file_name+"'.", 
        default=output_file_name , metavar="OUTPUT_FILE")

    # See
    # https://stackoverflow.com/questions/4480075/argparse-optional-positional-arguments 
    # and 
    # https://docs.python.org/2/library/argparse.html#nargs for use of `nargs='?'` 
    # to make output file name optional. Note that the square brackets
    # shown in the usage out signify optional according to 
    # https://stackoverflow.com/questions/4480075/argparse-optional-positional-arguments#comment40460395_4480202
    # , but because placed under positional I added clarifying text to help 
    # description.
    # IF MODIFYING THIS SCRIPT FOR USE ELSEWHERE AND DON'T NEED/WANT THE OUTPUT 
    # FILE TO BE OPTIONAL, remove `nargs` (& default?) BUT KEEP WHERE NOT
    # USING `argparse.FileType` AND USING `with open` AS CONISDERED MORE PYTHONIC.



    #I would also like trigger help to display if no arguments provided because 
    # need at least one for url
    if len(sys.argv)==1:    #from http://stackoverflow.com/questions/4042452/display-help-message-with-python-argparse-when-script-is-called-without-any-argu
        parser.print_help()
        sys.exit(1)
    args = parser.parse_args()
    url= args.URL
    output_file_name = args.output
    species_code = args.species_code


    main()

#*******************************************************************************
###-***********************END MAIN PORTION OF SCRIPT***********************-###
#*******************************************************************************

As with pasting the code into, once it is loaded into a cell there are options for calling the main function. Demonstrating those:

In [8]:

UCSC_chrom_sizes_2_circos_karyotype()

Based on the provided URL, the following species code will be used in the
ID column in the karyotype file: 'Sc'.
If that is not suitable, you can re-run the script and provide one when calling
the script using the `--species_code` flag. Alternatively, edit the produced file with find/replace.

The karyotype file for 17 chromosomes has been saved as a file named 'karyotype.tab'.

To provide your own settings, set the variables the script will use before calling it, like so:

In [9]:

url="http://hgdownload.cse.ucsc.edu/goldenPath/canFam2/bigZips/canFam2.chrom.sizes"
UCSC_chrom_sizes_2_circos_karyotype(url)

Based on the provided URL, the following species code will be used in the
ID column in the karyotype file: 'canFam'.
If that is not suitable, you can re-run the script and provide one when calling
the script using the `--species_code` flag. Alternatively, edit the produced file with find/replace.

The karyotype file for 41 chromosomes has been saved as a file named 'karyotype.tab'.

Running core function of the script after importing¶

This is similar to the last section,'Running core function of the script after loading into a cell', but here we take advantage of Python's import statement to do what we did by pasting or loading code into a cell and running it. This is the preferred way to use the main function script if you are using it inside a Jupyter notebook or IPython notebook. (The above section was just included as that is more easily followed than explaning the use of import.)

First insure the script is available where you are running. Running the next command will do that here. (You may have already down it earlier, but it is okay to run again.)

In [10]:

!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/circos-utilities/UCSC_chrom_sizes_2_circos_karyotype.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 16358  100 16358    0     0   202k      0 --:--:-- --:--:-- --:--:--  202k

Then import the main function of the script to the notebook's active computational environment via an import statement.

In [11]:

from UCSC_chrom_sizes_2_circos_karyotype import UCSC_chrom_sizes_2_circos_karyotype

(As written above the command to do that looks a bit redundant; however, the first from part of the command below actually is referencing the UCSC_chrom_sizes_2_circos_karyotype script, but it doesn't need the .py extension because the import only deals with such files.)

With the main function imported, it is now available to be run.

In [12]:

UCSC_chrom_sizes_2_circos_karyotype()

Based on the provided URL, the following species code will be used in the
ID column in the karyotype file: 'Sc'.
If that is not suitable, you can re-run the script and provide one when calling
the script using the `--species_code` flag. Alternatively, edit the produced file with find/replace.

The karyotype file for 17 chromosomes has been saved as a file named 'karyotype.tab'.

To provide your own settings, set the variables the script will use before calling it with that varible setting, like so:

In [13]:

url="http://hgdownload.cse.ucsc.edu/goldenPath/canFam2/bigZips/canFam2.chrom.sizes"
UCSC_chrom_sizes_2_circos_karyotype(url)

Based on the provided URL, the following species code will be used in the
ID column in the karyotype file: 'canFam'.
If that is not suitable, you can re-run the script and provide one when calling
the script using the `--species_code` flag. Alternatively, edit the produced file with find/replace.

The karyotype file for 41 chromosomes has been saved as a file named 'karyotype.tab'.

Or directly call it with that long URL, like below. You can also specify the other setting allowed, which is species_code, at the same time, like so:

In [14]:

UCSC_chrom_sizes_2_circos_karyotype("http://hgdownload.cse.ucsc.edu/goldenPath/canFam2/bigZips/canFam2.chrom.sizes",species_code="Dog")

The following species code will be used in the ID column in the
produced karyotype file: 'Dog'.

The karyotype file for 41 chromosomes has been saved as a file named 'karyotype.tab'.

Setting species_code let's you apply what you want for the species instead of relying on the script to extract it.

Save the karyotype file produced to your local machine if you are running this not on your own machine.
Enjoy!