import addutils.toc ; addutils.toc.js(ipy_notebook=True)
from addutils import css_notebook css_notebook()
This notebook is about the task of searching and managing substrings (matches) of a string. This is useful to extract piece of information from a text, for example when parsing dates, urls, e-mails, data lists, configuration files or programing scripts. Python offers some string methods for managing the simplest requirements, but the most powerful solution is offered by a language-free pattern matching standard: regular expressions.
Regular expressions are a sort of very specialized programming language made of special text strings (meta-characters) designed for describing a search pattern. Python has also some packages suitable for managing regular expression, such as python re, the regular expression module contained in the python standard distribution, or pyregex, a new external package under development (not treated in this notebook).
One of the most common requirements is to find a given word, or set of characters/numbers from a text. The find functions returns the positional index of the first character we were looking for, if a match is found; it returns -1 if not found.
string = "this is string example....wow!!!" part = "wow!!!" part2 = "strong" print(string.find(part)) print(string.find(part2))
Other functions help to clean and extract only useful information
string = "0000000this is string example....wow!!!0000000" print(string.strip('0')) print(string.lstrip('0')) print(string.rstrip('0'))
this is string example....wow!!! this is string example....wow!!!0000000 0000000this is string example....wow!!!
string = "this is string example....wow!!!" spl = string.replace('string', 'good') spl
'this is good example....wow!!!'
a series of methods, and even simple idiomatic expressions using basic operators, returning True or False, such as isalnum (checking for alphanumeric), isalpha (only alphabetic), isdigit (numbers), isspace (whitespace), islower (lowercase), isupper (uppercase), istitle (titlecase, if all words in a string starts with uppercase), startswith, endswith.
"a" in 'xyxxyabcxyzzy'
string = "this is string example....wow!!!" print(string.startswith('this')) print(string.startswith('is')) print(string.startswith('string', 8)) # start index at the matching boundary
True False True
string = 'this' string.isalpha()
string = 'this ' # whitespace is not alphabetic! print(string.isalpha())
Try by yourself the other methods:
string = 'this' print string.isupper() print string.islower() print string.istitle() print string.isalnum() print string.isspace() print string.isdigit()
Try also by modifying the string:
mod = string.upper() mod = string.title() print mod.isupper()
Let see how we could clean a string with some unuseful elements, using python built-in methods:
string = "this 44444is a99999 dirty 678435 string xxxxxxexample....wow000000!!!" spl = string.split() print(spl)
['this', '44444is', 'a99999', 'dirty', '678435', 'string', 'xxxxxxexample....wow000000!!!']
ls =  for i, item in enumerate(spl): if item.find('xxx') != -1: item = item.lstrip('x') result = ''.join([e for e in item if not e.isdigit()]) if result: # needed to exclude empty strings ls.append(result) print('The temporary cleaned list looks like this: ', ls)
The temporary cleaned list looks like this: ['this', 'is', 'a', 'dirty', 'string', 'example....wow!!!']
Get back to string again, after complete cleaning and slight modifying:
string = ' '.join(ls) final = string.replace('....', ', ') final = string.replace('dirty', 'clean') print(final)
this is a clean string example....wow!!!
For simple string management python built-in methods are enough, but when we are dealing to more complex tasks, regular expressions are the best solution for dealing with pattern matching.
A regular expression (regex or regexp for short) is a special text string for describing a search pattern. Regular expressions may be used for retrieving parts of longer strings matching some desired criteria. Dealing with regular expressions may seem complex at the beginning, since they are made of both regular and special characters concatenated in a sequence, hard to be understood at a first sight. But once they are fully assimilated, they become a powerful helper while parsing any kind of text. The most basic regular expressions are single literal characters, for example "a" will look for all "a" occurrence in a text. But there are some special characters, also called meta-characters, which combined with regular characters and concatenated build the regular expression search patterns. The meta-characters used by regular expressions are:
The following link refers to a list of regular expressions, and the description of their use:
from IPython.display import HTML HTML('<iframe src=https://help.libreoffice.org/Common/List_of_Regular_Expressions width=700 height=250>')
Online there are many tools suitable for testing the effectiveness of a regular expression, such as:
HTML('<iframe src=http://pythex.org// width=700 height=250>')
Python re module is the python standard distribution module for the regular expressions ). It offers some methods for compiling regular expressions to RegexObjects, used to search, manage and return the expected matches.
re.match() is suitable to find a match at the beginning of a string.
line = "Cats are smarter than dogs"; matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I) if matchObj: print("matchObj.group() : ", matchObj.group()) print("matchObj.group(1) : ", matchObj.group(1)) print("matchObj.group(2) : ", matchObj.group(2)) else: print("No match!!")
matchObj.group() : Cats are smarter than dogs matchObj.group(1) : Cats matchObj.group(2) : smarter
re.search() is similar finds a match anywhere inside a text.
string = 'purple firstname.lastname@example.org monkey dishwasher' match = re.search(r'@[\w.]+', string) if match: print(match.group()) ## 'b@google'
match = re.search(r'[\w.-]+@[\w.-]+', string) if match: print(match.group()) ## 'email@example.com'
string = 'purple firstname.lastname@example.org monkey dishwasher' match = re.search('([\w.-]+)@([\w.-]+)', string) if match: print(match.group()) ## 'email@example.com' (the whole match) print(match.group(1)) ## 'alice-b' (the username, group 1) print(match.group(2)) ## 'google.com' (the host, group 2)
firstname.lastname@example.org alice-b google.com
re.findall() finds all the matches for a given regular expression.
## Suppose we have a text with many email addresses string = 'purple email@example.com, blah monkey firstname.lastname@example.org blah dishwasher' ## Here re.findall() returns a list of all the found email strings emails = re.findall(r'[\w\.-]+@[\w\.-]+', string) ## ['email@example.com', 'firstname.lastname@example.org'] for email in emails: # do something with each found email string print(email)
string = 'purple email@example.com, blah monkey firstname.lastname@example.org blah dishwasher' tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', string) print(tuples) ## [('alice', 'google.com'), ('bob', 'abc.com')] for tup in tuples: print(tup) ## username print(tup) ## host
[('alice', 'google.com'), ('bob', 'abc.com')] alice google.com bob abc.com
re.sub() is suitable for replacing occurences ot the regex pattern with a given substitute.
phone = "2004-959-559 # This is Phone Number" # Delete Python-style comments num = re.sub(r'#.*$', "", phone) print("Phone Num : ", num) # Remove anything other than digits num = re.sub(r'\D', "", phone) print("Phone Num : ", num)
Phone Num : 2004-959-559 Phone Num : 2004959559
Opening and looking for matches in a .txt file, in this case finding all words preceeded or followed by a - symbol.
# Open file import os.path path = os.path.join(os.path.curdir, "example_data", "small_is_beautiful.txt") f = open(path, 'r') # Feed the file text into findall(); it returns a list of all the found strings text = f.read() strings = re.findall(r'(\w*-\w*)', text) print("Matches are: ", strings) # Follows the text parsed by the findall method: print('text is: \n', text) f.close()
Matches are: ['so-called', 'is-', '-or', 'been-', '-the', 'laissez-faire', 'self-evident'] text is: No system or machinery or economic doctrine or theory stands on its own feet: it is invariably built on a metaphysical foundation, that is to say, upon man's basic outlook on life, its meaning and its purpose. I have talked about the religion of economics, the idol worship of material possessions, of consumption and the so-called standard of living, and the fateful propensity that rejoices in the fact that 'what were luxuries to our fathers have become necessities for us.' Systems are never more no less than incarnations of man's most basic attitudes. . . . General evidence of material progress would suggest that the modern private enterprise system is--or has been--the most perfect instrument for the pursuit of personal enrichment. The modern private enterprise system ingeniously employs the human urges of greed and envy as its motive power, but manages to overcome the most blatant deficiencies of laissez-faire by means of Keynesian economic management, a bit of redistributive taxation, and the 'countervailing power' of the trade unions. Can such a system conceivably deal with the problems we are now having to face? The answer is self-evident: greed and envy demand continuous and limitless economic growth of a material kind, without proper regard for conservation, and this type of growth cannot possibly fit into a finite environment. We must therefore study the essential nature of the private enterprise system and the possibilities of evolving an alternative system which might fit the new situation.