Automating a Workflow: Beyond Blast - to GO Slim¶

In [1]:

!date

Fri Feb 14 08:10:37 PST 2014

Updates - blast full path
subsequent remove of 'blast' variable use as now full path

--

have to manually change sqlshare id in code (for now)

The concept is that you can take a fasta file in a working directory and end up with GO slim information all within a single notebook that is automated. Currently this work by writing (and overwriting) as scracth file to SQLShare. Assumptions are that you are working in a directory with fasta file named query.fa. And you have SQLShare Python client install

In [1]:

#allows plots to be shown inline
%pylab inline

Populating the interactive namespace from numpy and matplotlib

In [5]:

#Setting Working Directory
wd="/Users/Mackenzie/Desktop/FISH546/wd"
#Setting directory of Blast Databases !!! make sure you have last '/'
dbd="/Users/Mackenzie/Desktop/FISH546/db/"
#Database name
dbn="spdb"
#Blast algorithim complete path
ba="/Users/Shared/Apps/ncbi-blast-2.2.29\+/bin/blastx"
#Location of SQLShare python tools: you can empty ("") if tools are in PATH !!! make sure you have last '/'
spd="/Users/Mackenzie/sqlshare-pythonclient/tools/"

In [6]:

cd {wd}

[Errno 13] Permission denied: '/Users/Mackenzie/Desktop/FISH546/wd'
/Users/Steven/Dropbox/Steven/ipython_nb/tools

In [ ]:

#for some reason max hsp produced error and removed

In [17]:

!{ba} -query query.fa -db {dbd}{dbn} -out {dbn}_blast_out.tab -evalue 1E-50 -num_threads 4 -max_target_seqs 1 -outfmt 6

Selenocysteine (U) at position 52 replaced by X
Selenocysteine (U) at position 49 replaced by X
Selenocysteine (U) at position 47 replaced by X
Selenocysteine (U) at position 47 replaced by X
Selenocysteine (U) at position 47 replaced by X
Selenocysteine (U) at position 47 replaced by X
Selenocysteine (U) at position 52 replaced by X
Selenocysteine (U) at position 47 replaced by X
Selenocysteine (U) at position 47 replaced by X
Selenocysteine (U) at position 40 replaced by X
Selenocysteine (U) at position 40 replaced by X
Selenocysteine (U) at position 40 replaced by X
Selenocysteine (U) at position 40 replaced by X
Selenocysteine (U) at position 40 replaced by X
Selenocysteine (U) at position 40 replaced by X
Selenocysteine (U) at position 40 replaced by X
Selenocysteine (U) at position 40 replaced by X
Selenocysteine (U) at position 690 replaced by X
Selenocysteine (U) at position 690 replaced by X
Selenocysteine (U) at position 667 replaced by X
Selenocysteine (U) at position 667 replaced by X
Selenocysteine (U) at position 665 replaced by X
Selenocysteine (U) at position 665 replaced by X
^C

In [28]:

!head -1 {dbn}_blast_out.tab

ConsensusfromContig5	sp|Q9JHQ5|LZTL1_MOUSE	74.40	125	31	1	7	378	24	148	1e-59	  192

In [29]:

#Translate pipes to tab so SPID is in separate column for Joining
!tr '|' "\t" <{dbn}_blast_out.tab> {dbn}_blast_out2.tab

In [30]:

!head -1 {dbn}_blast_out2.tab

ConsensusfromContig5	sp	Q9JHQ5	LZTL1_MOUSE	74.40	125	31	1	7	378	24	148	1e-59	  192

In [35]:

#Uploads formatted blast table to SQLshare; currently has generic name and meant to be temporary: Warning will overwrite.
!python {spd}singleupload.py -d scratchblast_out {dbn}_blast_out2.tab

processing chunk line 0 to 153 (0.000229120254517 s elapsed)
pushing spdb_blast_out2.tab...
parsing 40DB86D8...
finished scratchblast_out

In [36]:

!python {spd}fetchdata.py -s "SELECT * FROM [mgavery@washington.edu].[scratchblast_out]blast Left Join [sr320@washington.edu].[uniprot-reviewed_wGO_010714]unp ON blast.Column3 = unp.Entry Left Join [sr320@washington.edu].[SPID and GO Numbers]go ON unp.Entry = go.SPID Left Join [sr320@washington.edu].[GO_to_GOslim]slim ON slim.GO_id = go.GOID" -f tsv -o {dbn}_join2goslim.txt

In [37]:

!head -2 {dbn}_join2goslim.txt

In [38]:

!python {spd}singleupload.py -d scratchjoin_slim {dbn}_join2goslim.txt

processing chunk line 0 to 1978 (0.00637292861938 s elapsed)
pushing spdb_join2goslim.txt...
parsing 94DDEBBA...
finished scratchjoin_slim

In [39]:

#Sets GO aspect 
!python {spd}fetchdata.py -s "SELECT Distinct Column1 as query, Column3 as SPID, GOSlim_bin FROM [mgavery@washington.edu].[scratchjoin_slim] Where aspect = 'P'" -f tsv -o justslim.txt

In [40]:

!head justslim.txt

In [3]:

from pandas import *

jslim = read_table("justslim.txt", # name of the data file
            #sep=",", # what character separates each column?
            na_values=["", " "]) # what values should be considered "blank" values?

---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-3-f6b9dbe27bfa> in <module>()
      3 jslim = read_table("justslim.txt", # name of the data file
      4             #sep=",", # what character separates each column?
----> 5             na_values=["", " "]) # what values should be considered "blank" values?

//anaconda/lib/python2.7/site-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze)
    399                     buffer_lines=buffer_lines)
    400 
--> 401         return _read(filepath_or_buffer, kwds)
    402 
    403     parser_f.__name__ = name

//anaconda/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
    207 
    208     # Create the parser.
--> 209     parser = TextFileReader(filepath_or_buffer, **kwds)
    210 
    211     if nrows is not None:

//anaconda/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
    507             self.options['has_index_names'] = kwds['has_index_names']
    508 
--> 509         self._make_engine(self.engine)
    510 
    511     def _get_options_with_defaults(self, engine):

//anaconda/lib/python2.7/site-packages/pandas/io/parsers.pyc in _make_engine(self, engine)
    609     def _make_engine(self, engine='c'):
    610         if engine == 'c':
--> 611             self._engine = CParserWrapper(self.f, **self.options)
    612         else:
    613             if engine == 'python':

//anaconda/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, src, **kwds)
    891         # #2442
    892         kwds['allow_leading_cols'] = self.index_col is not False
--> 893         self._reader = _parser.TextReader(src, **kwds)
    894 
    895         # XXX

//anaconda/lib/python2.7/site-packages/pandas/_parser.so in pandas._parser.TextReader.__cinit__ (pandas/src/parser.c:2771)()

//anaconda/lib/python2.7/site-packages/pandas/_parser.so in pandas._parser.TextReader._setup_parser_source (pandas/src/parser.c:4803)()

IOError: File justslim.txt does not exist

In [ ]:

jslim.groupby('GOSlim_bin').query.count().plot(kind='bar')

In [43]:

!say "hash tag winning"

Below is optional¶

In [13]:

#could also upload again to get a simple table
#could be done in pandas

#!python {spd}singleupload.py -d scratchpie justslim.txt

processing chunk line 0 to 2538 (0.00250601768494 s elapsed)
pushing justslim.txt...
parsing 87B0B7A8...
finished scratchpie

In [14]:

#fetching data grouped by GObin

#!python {spd}fetchdata.py -s "SELECT GOSlim_bin, COUNT(GOSlim_bin) as termcount from [sr320@washington.edu].[scratchpie] Group by GOSlim_bin" -f tsv -o justpie.txt

In [ ]: