Excerpt from "Coding for Scientists"
(C) Fabrizio Smeraldi 2014
http://www.eecs.qmul.ac.uk/~fabri
Queen Mary, University of London
Regular expressions (or REGEX) are compact ways of summarising a text pattern. Simple instances of such expressions are very common indeed: for instance typing
ls *.py
in a Linux shell will list all files that end in .py
. The character *
is known as a wildchar.
Regular expressions help with extracting information from text files (eg BLAST output or FASTA files) by locating particular patters. For instance, in a FASTA file, the accession number always comes between a ">" and a "|": such a pattern can be easily described by a regular expression.
>P04637|P53_HUMAN Cellular tumor antigen p53 - Homo sapiens (Human).
Also, databases such as PROSITE list regular expressions that can be applied directly to protein sequences to identify particular families of proteins or domains. http://prosite.expasy.org/
Regular expression syntax in Python is very similar to PERL syntax, so migrating between the two languages should not be difficult.
re
Module¶In Python, REGEXP support is provided in the re
module. Simple usage is indeed straightforward:
import re
# mo is a "match object"
mo=re.search("hello", "Hello world, hello Python!")
print mo.group()
print mo.span()
hello (13, 18)
This is not too different from the .index()
method of a string:
print "Hello world, hello Python!".index("hello")
13
But it is a lot more flexible:
re.findall("[Hh][ea]llo", "Hallo world, hello Python!")
['Hallo', 'hello']
here the square brackets express an alternative within a set of characters.
If a match is not found, the search returns None:
mo=re.search("hello", "Hi world!")
print mo
None
We have already seen .search()
, that finds the first match only, and .findall()
.
The re
module offers four matching operators:
Method/Attribute | Purpose |
---|---|
match() | Determine if the RE matches at the beginning of the string. |
search() | Scan through a string, looking for any location where this RE matches. |
findall() | Find all substrings where the RE matches, and returns them as a list. |
finditer() | Find all substrings where the RE matches, and returns them as an iterator(*). |
(*) an iterator works very much like a list in that for instance you can loop over it, but items are computed on the fly as they are needed, so it is more memory-efficient.
For reasons of efficiency, if a pattern is going to be used repeatedly, it is best to compile it. This is done as follows:
rgx=re.compile("[Hh][ea]llo")
rgx.findall("Hallo world, hello Python!")
['Hallo', 'hello']
the same search functions listed above are available as methods of the compiled expression object.
Regular expressions are a powerful tool, though a bit tedious to learn. Besides matching very complex patterns indeed, other operations that are possible are splitting a string where a pattern matches and substitution. I invite you to have a look at the official tutorial to get a feeling for what can be done: https://docs.python.org/2/howto/regex.html#regex-howto
As you will see, REGEXP syntax makes heavy use of backslashes. This is a problem in Python, because a backslash is interpreted as an escape character:
print "escape\nsequence"
escape sequence
The solution is to use the Python "raw string" syntax by prepending an "r" to the string in question:
print r"escape\nsequence"
escape\nsequence
to be on the safe side, you may want to put an "r" before all of the regular expressions you write. Examle:
solomon="""
Solomon Grundy,
Born on a Monday,
Christened on Tuesday,
Married on Wednesday,
Took ill on Thursday,
Grew worse on Friday,
Died on Saturday,
Buried on Sunday.
That was the end of,
Solomon Grundy."""
# \w+ matches one or more alphanumeric characters
rgx=re.compile(r"\w+day")
rgx.findall(solomon)
['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
The Tihioredoxin pattern listed on PROSITE with accession number PS00194 (http://prosite.expasy.org/PS00194) is the following:
[LIVMF]-[LIVMSTA]-x-[LIVMFYC]-[FYWSTHE]-x(2)-[FYWGTN]-C-[GATPLVE]-
[PHYWSTA]-C-{I}-x-{A}-x(3)-[LIVMFYWT].
We can easily translate this to a Python REGEXP:
r'[LIVMF][LIVMSTA]\w[LIVMFYC][FYWSTHE]\w\w[FYWGTN]C[GATPLVE][PHYWSTA]C[^I]\w[^A]\w\w\w[LIVMFYWT]'
where "\w" matches any character and for example [^I]
will match anything except an I
. The following code scans the chicken proteome for matches and prints out the accession number of the proteins that match.
(the chicken proteome can be retrieved from ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/proteomes/)
""" Browse chicken genome and find all proteins that match
PROSITE patters PS00194 (THIOREDOXIN_1) """
import re
# Compile the regexp
PS00194=(r'[LIVMF][LIVMSTA]\w[LIVMFYC][FYWSTHE]\w\w[FYWGTN]'+
r'C[GATPLVE][PHYWSTA]C[^I]\w[^A]\w\w\w[LIVMFYWT]')
rgx=re.compile(PS00194)
INFILE=open("CHICK.fasta", "r")
seq="" # build sequence here
header="" # name of protein
for line in INFILE:
if line[0]==">": # this line is a header
# search protein we just read and print header
# if pattern is found
if (rgx.search(seq)!=None):
print header
# update header and reset sequence
header=line.rstrip()
seq=""
else: # this line contains part of the sequence
seq+=line.rstrip() # remove trailing newline
# process the last protein
if (rgx.search(seq)!=None):
print header
INFILE.close()
>tr|E1BRA6|E1BRA6_CHICK Uncharacterized protein OS=Gallus gallus GN=DNAJC10 PE=4 SV=2 >tr|E1BUP6|E1BUP6_CHICK Uncharacterized protein (Fragment) OS=Gallus gallus GN=PDIA5 PE=4 SV=2 >tr|E1BXX8|E1BXX8_CHICK Uncharacterized protein OS=Gallus gallus GN=TXNDC3 PE=4 SV=1 >tr|E1BZS8|E1BZS8_CHICK Uncharacterized protein OS=Gallus gallus GN=TXNL1 PE=4 SV=1 >tr|E1C549|E1C549_CHICK Uncharacterized protein OS=Gallus gallus GN=VPS13A PE=4 SV=2 >tr|E1C928|E1C928_CHICK Uncharacterized protein OS=Gallus gallus GN=TXNRD3 PE=3 SV=1 >tr|F1N9H3|F1N9H3_CHICK Protein disulfide-isomerase OS=Gallus gallus GN=P4HB PE=3 SV=2 >tr|F1NCD5|F1NCD5_CHICK Uncharacterized protein OS=Gallus gallus GN=TXN2 PE=4 SV=2 >tr|F1NDY9|F1NDY9_CHICK Protein disulfide-isomerase A4 OS=Gallus gallus GN=PDIA4 PE=3 SV=1 >tr|F1NK96|F1NK96_CHICK Uncharacterized protein OS=Gallus gallus GN=PDIA6 PE=3 SV=1 >tr|F1NLC7|F1NLC7_CHICK Uncharacterized protein OS=Gallus gallus GN=TXNDC12 PE=4 SV=2 >tr|F1P212|F1P212_CHICK Uncharacterized protein (Fragment) OS=Gallus gallus GN=TMX3 PE=4 SV=2 >tr|F1P4H4|F1P4H4_CHICK Uncharacterized protein OS=Gallus gallus GN=TXNDC5 PE=3 SV=1 >sp|P08629|THIO_CHICK Thioredoxin OS=Gallus gallus GN=TXN PE=3 SV=2 >sp|P09102|PDIA1_CHICK Protein disulfide-isomerase OS=Gallus gallus GN=P4HB PE=1 SV=3 >sp|P12244|GSBP_CHICK Dolichyl-diphosphooligosaccharide--protein glycotransferase OS=Gallus gallus PE=2 SV=2 >sp|Q8JG64|PDIA3_CHICK Protein disulfide-isomerase A3 OS=Gallus gallus GN=PDIA3 PE=2 SV=1 >tr|R4GFY2|R4GFY2_CHICK Uncharacterized protein OS=Gallus gallus GN=TMX4 PE=4 SV=1 >tr|R4GGT2|R4GGT2_CHICK Uncharacterized protein (Fragment) OS=Gallus gallus GN=LOC100857897 PE=4 SV=1