#!/usr/bin/env python
# coding: utf-8
#
#
# *This notebook contains an excerpt from the [Whirlwind Tour of Python](http://www.oreilly.com/programming/free/a-whirlwind-tour-of-python.csp) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/WhirlwindTourOfPython).*
#
# *The text and code are released under the [CC0](https://github.com/jakevdp/WhirlwindTourOfPython/blob/master/LICENSE) license; see also the companion project, the [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook).*
#
#
# < [Modules and Packages](13-Modules-and-Packages.ipynb) | [Contents](Index.ipynb) | [A Preview of Data Science Tools](15-Preview-of-Data-Science-Tools.ipynb) >
# # String Manipulation and Regular Expressions
# One place where the Python language really shines is in the manipulation of strings.
# This section will cover some of Python's built-in string methods and formatting operations, before moving on to a quick guide to the extremely useful subject of *regular expressions*.
# Such string manipulation patterns come up often in the context of data science work, and is one big perk of Python in this context.
#
# Strings in Python can be defined using either single or double quotations (they are functionally equivalent):
# In[1]:
x = 'a string'
y = "a string"
x == y
# In addition, it is possible to define multi-line strings using a triple-quote syntax:
# In[2]:
multiline = """
one
two
three
"""
# With this, let's take a quick tour of some of Python's string manipulation tools.
# ## Simple String Manipulation in Python
#
# For basic manipulation of strings, Python's built-in string methods can be extremely convenient.
# If you have a background working in C or another low-level language, you will likely find the simplicity of Python's methods extremely refreshing.
# We introduced Python's string type and a few of these methods earlier; here we'll dive a bit deeper
# ### Formatting strings: Adjusting case
#
# Python makes it quite easy to adjust the case of a string.
# Here we'll look at the ``upper()``, ``lower()``, ``capitalize()``, ``title()``, and ``swapcase()`` methods, using the following messy string as an example:
# In[3]:
fox = "tHe qUICk bROWn fOx."
# To convert the entire string into upper-case or lower-case, you can use the ``upper()`` or ``lower()`` methods respectively:
# In[4]:
fox.upper()
# In[5]:
fox.lower()
# A common formatting need is to capitalize just the first letter of each word, or perhaps the first letter of each sentence.
# This can be done with the ``title()`` and ``capitalize()`` methods:
# In[6]:
fox.title()
# In[7]:
fox.capitalize()
# The cases can be swapped using the ``swapcase()`` method:
# In[8]:
fox.swapcase()
# ### Formatting strings: Adding and removing spaces
#
# Another common need is to remove spaces (or other characters) from the beginning or end of the string.
# The basic method of removing characters is the ``strip()`` method, which strips whitespace from the beginning and end of the line:
# In[9]:
line = ' this is the content '
line.strip()
# To remove just space to the right or left, use ``rstrip()`` or ``lstrip()`` respectively:
# In[10]:
line.rstrip()
# In[11]:
line.lstrip()
# To remove characters other than spaces, you can pass the desired character to the ``strip()`` method:
# In[12]:
num = "000000000000435"
num.strip('0')
# The opposite of this operation, adding spaces or other characters, can be accomplished using the ``center()``, ``ljust()``, and ``rjust()`` methods.
#
# For example, we can use the ``center()`` method to center a given string within a given number of spaces:
# In[13]:
line = "this is the content"
line.center(30)
# Similarly, ``ljust()`` and ``rjust()`` will left-justify or right-justify the string within spaces of a given length:
# In[14]:
line.ljust(30)
# In[15]:
line.rjust(30)
# All these methods additionally accept any character which will be used to fill the space.
# For example:
# In[16]:
'435'.rjust(10, '0')
# Because zero-filling is such a common need, Python also provides ``zfill()``, which is a special method to right-pad a string with zeros:
# In[17]:
'435'.zfill(10)
# ### Finding and replacing substrings
#
# If you want to find occurrences of a certain character in a string, the ``find()``/``rfind()``, ``index()``/``rindex()``, and ``replace()`` methods are the best built-in methods.
#
# ``find()`` and ``index()`` are very similar, in that they search for the first occurrence of a character or substring within a string, and return the index of the substring:
# In[18]:
line = 'the quick brown fox jumped over a lazy dog'
line.find('fox')
# In[19]:
line.index('fox')
# The only difference between ``find()`` and ``index()`` is their behavior when the search string is not found; ``find()`` returns ``-1``, while ``index()`` raises a ``ValueError``:
# In[20]:
line.find('bear')
# In[21]:
line.index('bear')
# The related ``rfind()`` and ``rindex()`` work similarly, except they search for the first occurrence from the end rather than the beginning of the string:
# In[22]:
line.rfind('a')
# For the special case of checking for a substring at the beginning or end of a string, Python provides the ``startswith()`` and ``endswith()`` methods:
# In[23]:
line.endswith('dog')
# In[24]:
line.startswith('fox')
# To go one step further and replace a given substring with a new string, you can use the ``replace()`` method.
# Here, let's replace ``'brown'`` with ``'red'``:
# In[25]:
line.replace('brown', 'red')
# The ``replace()`` function returns a new string, and will replace all occurrences of the input:
# In[26]:
line.replace('o', '--')
# For a more flexible approach to this ``replace()`` functionality, see the discussion of regular expressions in [Flexible Pattern Matching with Regular Expressions](#Flexible-Pattern-Matching-with-Regular-Expressions).
# ### Splitting and partitioning strings
#
# If you would like to find a substring *and then* split the string based on its location, the ``partition()`` and/or ``split()`` methods are what you're looking for.
# Both will return a sequence of substrings.
#
# The ``partition()`` method returns a tuple with three elements: the substring before the first instance of the split-point, the split-point itself, and the substring after:
# In[27]:
line.partition('fox')
# The ``rpartition()`` method is similar, but searches from the right of the string.
#
# The ``split()`` method is perhaps more useful; it finds *all* instances of the split-point and returns the substrings in between.
# The default is to split on any whitespace, returning a list of the individual words in a string:
# In[28]:
line.split()
# A related method is ``splitlines()``, which splits on newline characters.
# Let's do this with a Haiku, popularly attributed to the 17th-century poet Matsuo Bashō:
# In[29]:
haiku = """matsushima-ya
aah matsushima-ya
matsushima-ya"""
haiku.splitlines()
# Note that if you would like to undo a ``split()``, you can use the ``join()`` method, which returns a string built from a splitpoint and an iterable:
# In[30]:
'--'.join(['1', '2', '3'])
# A common pattern is to use the special character ``"\n"`` (newline) to join together lines that have been previously split, and recover the input:
# In[31]:
print("\n".join(['matsushima-ya', 'aah matsushima-ya', 'matsushima-ya']))
# ## Format Strings
#
# In the preceding methods, we have learned how to extract values from strings, and to manipulate strings themselves into desired formats.
# Another use of string methods is to manipulate string *representations* of values of other types.
# Of course, string representations can always be found using the ``str()`` function; for example:
# In[32]:
pi = 3.14159
str(pi)
# For more complicated formats, you might be tempted to use string arithmetic as outlined in [Basic Python Semantics: Operators](04-Semantics-Operators.ipynb):
# In[33]:
"The value of pi is " + str(pi)
# A more flexible way to do this is to use *format strings*, which are strings with special markers (noted by curly braces) into which string-formatted values will be inserted.
# Here is a basic example:
# In[34]:
"The value of pi is {}".format(pi)
# Inside the ``{}`` marker you can also include information on exactly *what* you would like to appear there.
# If you include a number, it will refer to the index of the argument to insert:
# In[35]:
"""First letter: {0}. Last letter: {1}.""".format('A', 'Z')
# If you include a string, it will refer to the key of any keyword argument:
# In[36]:
"""First letter: {first}. Last letter: {last}.""".format(last='Z', first='A')
# Finally, for numerical inputs, you can include format codes which control how the value is converted to a string.
# For example, to print a number as a floating point with three digits after the decimal point, you can use the following:
# In[37]:
"pi = {0:.3f}".format(pi)
# As before, here the "``0``" refers to the index of the value to be inserted.
# The "``:``" marks that format codes will follow.
# The "``.3f``" encodes the desired precision: three digits beyond the decimal point, floating-point format.
#
# This style of format specification is very flexible, and the examples here barely scratch the surface of the formatting options available.
# For more information on the syntax of these format strings, see the [Format Specification](https://docs.python.org/3/library/string.html#formatspec) section of Python's online documentation.
# ## Flexible Pattern Matching with Regular Expressions
#
# The methods of Python's ``str`` type give you a powerful set of tools for formatting, splitting, and manipulating string data.
# But even more powerful tools are available in Python's built-in *regular expression* module.
# Regular expressions are a huge topic; there are there are entire books written on the topic (including Jeffrey E.F. Friedl’s [*Mastering Regular Expressions, 3rd Edition*](http://shop.oreilly.com/product/9780596528126.do)), so it will be hard to do justice within just a single subsection.
#
# My goal here is to give you an idea of the types of problems that might be addressed using regular expressions, as well as a basic idea of how to use them in Python.
# I'll suggest some references for learning more in [Further Resources on Regular Expressions](#Further-Resources-on-Regular-Expressions).
#
# Fundamentally, regular expressions are a means of *flexible pattern matching* in strings.
# If you frequently use the command-line, you are probably familiar with this type of flexible matching with the "``*``" character, which acts as a wildcard.
# For example, we can list all the IPython notebooks (i.e., files with extension *.ipynb*) with "Python" in their filename by using the "``*``" wildcard to match any characters in between:
# In[38]:
get_ipython().system('ls *Python*.ipynb')
# Regular expressions generalize this "wildcard" idea to a wide range of flexible string-matching sytaxes.
# The Python interface to regular expressions is contained in the built-in ``re`` module; as a simple example, let's use it to duplicate the functionality of the string ``split()`` method:
# In[39]:
import re
regex = re.compile('\s+')
regex.split(line)
# Here we've first *compiled* a regular expression, then used it to *split* a string.
# Just as Python's ``split()`` method returns a list of all substrings between whitespace, the regular expression ``split()`` method returns a list of all substrings between matches to the input pattern.
#
# In this case, the input is ``"\s+"``: "``\s``" is a special character that matches any whitespace (space, tab, newline, etc.), and the "``+``" is a character that indicates *one or more* of the entity preceding it.
# Thus, the regular expression matches any substring consisting of one or more spaces.
#
# The ``split()`` method here is basically a convenience routine built upon this *pattern matching* behavior; more fundamental is the ``match()`` method, which will tell you whether the beginning of a string matches the pattern:
# In[40]:
for s in [" ", "abc ", " abc"]:
if regex.match(s):
print(repr(s), "matches")
else:
print(repr(s), "does not match")
# Like ``split()``, there are similar convenience routines to find the first match (like ``str.index()`` or ``str.find()``) or to find and replace (like ``str.replace()``).
# We'll again use the line from before:
# In[41]:
line = 'the quick brown fox jumped over a lazy dog'
# With this, we can see that the ``regex.search()`` method operates a lot like ``str.index()`` or ``str.find()``:
# In[42]:
line.index('fox')
# In[43]:
regex = re.compile('fox')
match = regex.search(line)
match.start()
# Similarly, the ``regex.sub()`` method operates much like ``str.replace()``:
# In[44]:
line.replace('fox', 'BEAR')
# In[45]:
regex.sub('BEAR', line)
# With a bit of thought, other native string operations can also be cast as regular expressions.
# ### A more sophisticated example
#
# But, you might ask, why would you want to use the more complicated and verbose syntax of regular expressions rather than the more intuitive and simple string methods?
# The advantage is that regular expressions offer *far* more flexibility.
#
# Here we'll consider a more complicated example: the common task of matching email addresses.
# I'll start by simply writing a (somewhat indecipherable) regular expression, and then walk through what is going on.
# Here it goes:
# In[46]:
email = re.compile('\w+@\w+\.[a-z]{3}')
# Using this, if we're given a line from a document, we can quickly extract things that look like email addresses
# In[47]:
text = "To email Guido, try guido@python.org or the older address guido@google.com."
email.findall(text)
# (Note that these addresses are entirely made up; there are probably better ways to get in touch with Guido).
#
# We can do further operations, like replacing these email addresses with another string, perhaps to hide addresses in the output:
# In[48]:
email.sub('--@--.--', text)
# Finally, note that if you really want to match *any* email address, the preceding regular expression is far too simple.
# For example, it only allows addresses made of alphanumeric characters that end in one of several common domain suffixes.
# So, for example, the period used here means that we only find part of the address:
# In[49]:
email.findall('barack.obama@whitehouse.gov')
# This goes to show how unforgiving regular expressions can be if you're not careful!
# If you search around online, you can find some suggestions for regular expressions that will match *all* valid emails, but beware: they are much more involved than the simple expression used here!
# ### Basics of regular expression syntax
#
# The syntax of regular expressions is much too large a topic for this short section.
# Still, a bit of familiarity can go a long way: I will walk through some of the basic constructs here, and then list some more complete resources from which you can learn more.
# My hope is that the following quick primer will enable you to use these resources effectively.
# #### Simple strings are matched directly
#
# If you build a regular expression on a simple string of characters or digits, it will match that exact string:
# In[50]:
regex = re.compile('ion')
regex.findall('Great Expectations')
# #### Some characters have special meanings
#
# While simple letters or numbers are direct matches, there are a handful of characters that have special meanings within regular expressions. They are:
# ```
# . ^ $ * + ? { } [ ] \ | ( )
# ```
# We will discuss the meaning of some of these momentarily.
# In the meantime, you should know that if you'd like to match any of these characters directly, you can *escape* them with a back-slash:
# In[51]:
regex = re.compile(r'\$')
regex.findall("the cost is $20")
# The ``r`` preface in ``r'\$'`` indicates a *raw string*; in standard Python strings, the backslash is used to indicate special characters.
# For example, a tab is indicated by ``"\t"``:
# In[52]:
print('a\tb\tc')
# Such substitutions are not made in a raw string:
# In[53]:
print(r'a\tb\tc')
# For this reason, whenever you use backslashes in a regular expression, it is good practice to use a raw string.
# #### Special characters can match character groups
#
# Just as the ``"\"`` character within regular expressions can escape special characters, turning them into normal characters, it can also be used to give normal characters special meaning.
# These special characters match specified groups of characters, and we've seen them before.
# In the email address regexp from before, we used the character ``"\w"``, which is a special marker matching *any alphanumeric character*. Similarly, in the simple ``split()`` example, we also saw ``"\s"``, a special marker indicating *any whitespace character*.
#
# Putting these together, we can create a regular expression that will match *any two letters/digits with whitespace between them*:
# In[54]:
regex = re.compile(r'\w\s\w')
regex.findall('the fox is 9 years old')
# This example begins to hint at the power and flexibility of regular expressions.
# The following table lists a few of these characters that are commonly useful:
#
# | Character | Description || Character | Description |
# |-----------|-----------------------------||-----------|---------------------------------|
# | ``"\d"`` | Match any digit || ``"\D"`` | Match any non-digit |
# | ``"\s"`` | Match any whitespace || ``"\S"`` | Match any non-whitespace |
# | ``"\w"`` | Match any alphanumeric char || ``"\W"`` | Match any non-alphanumeric char |
#
# This is *not* a comprehensive list or description; for more details, see Python's [regular expression syntax documentation](https://docs.python.org/3/library/re.html#re-syntax).
# #### Square brackets match custom character groups
#
# If the built-in character groups aren't specific enough for you, you can use square brackets to specify any set of characters you're interested in.
# For example, the following will match any lower-case vowel:
# In[55]:
regex = re.compile('[aeiou]')
regex.split('consequential')
# Similarly, you can use a dash to specify a range: for example, ``"[a-z]"`` will match any lower-case letter, and ``"[1-3]"`` will match any of ``"1"``, ``"2"``, or ``"3"``.
# For instance, you may need to extract from a document specific numerical codes that consist of a capital letter followed by a digit. You could do this as follows:
# In[56]:
regex = re.compile('[A-Z][0-9]')
regex.findall('1043879, G2, H6')
# #### Wildcards match repeated characters
#
# If you would like to match a string with, say, three alphanumeric characters in a row, it is possible to write, for example, ``"\w\w\w"``.
# Because this is such a common need, there is a specific syntax to match repetitions – curly braces with a number:
# In[57]:
regex = re.compile(r'\w{3}')
regex.findall('The quick brown fox')
# There are also markers available to match any number of repetitions – for example, the ``"+"`` character will match *one or more* repetitions of what precedes it:
# In[58]:
regex = re.compile(r'\w+')
regex.findall('The quick brown fox')
# The following is a table of the repetition markers available for use in regular expressions:
#
# | Character | Description | Example |
# |-----------|-------------|---------|
# | ``?`` | Match zero or one repetitions of preceding | ``"ab?"`` matches ``"a"`` or ``"ab"`` |
# | ``*`` | Match zero or more repetitions of preceding | ``"ab*"`` matches ``"a"``, ``"ab"``, ``"abb"``, ``"abbb"``... |
# | ``+`` | Match one or more repetitions of preceding | ``"ab+"`` matches ``"ab"``, ``"abb"``, ``"abbb"``... but not ``"a"`` |
# | ``{n}`` | Match ``n`` repetitions of preeeding | ``"ab{2}"`` matches ``"abb"`` |
# | ``{m,n}`` | Match between ``m`` and ``n`` repetitions of preceding | ``"ab{2,3}"`` matches ``"abb"`` or ``"abbb"`` |
# With these basics in mind, let's return to our email address matcher:
# In[59]:
email = re.compile(r'\w+@\w+\.[a-z]{3}')
# We can now understand what this means: we want one or more alphanumeric character (``"\w+"``) followed by the *at sign* (``"@"``), followed by one or more alphanumeric character (``"\w+"``), followed by a period (``"\."`` – note the need for a backslash escape), followed by exactly three lower-case letters.
#
# If we want to now modify this so that the Obama email address matches, we can do so using the square-bracket notation:
# In[60]:
email2 = re.compile(r'[\w.]+@\w+\.[a-z]{3}')
email2.findall('barack.obama@whitehouse.gov')
# We have changed ``"\w+"`` to ``"[\w.]+"``, so we will match any alphanumeric character *or* a period.
# With this more flexible expression, we can match a wider range of email addresses (though still not all – can you identify other shortcomings of this expression?).
# #### Parentheses indicate *groups* to extract
#
# For compound regular expressions like our email matcher, we often want to extract their components rather than the full match. This can be done using parentheses to *group* the results:
# In[61]:
email3 = re.compile(r'([\w.]+)@(\w+)\.([a-z]{3})')
# In[62]:
text = "To email Guido, try guido@python.org or the older address guido@google.com."
email3.findall(text)
# As we see, this grouping actually extracts a list of the sub-components of the email address.
#
# We can go a bit further and *name* the extracted components using the ``"(?P )"`` syntax, in which case the groups can be extracted as a Python dictionary:
# In[63]:
email4 = re.compile(r'(?P[\w.]+)@(?P\w+)\.(?P[a-z]{3})')
match = email4.match('guido@python.org')
match.groupdict()
# Combining these ideas (as well as some of the powerful regexp syntax that we have not covered here) allows you to flexibly and quickly extract information from strings in Python.
# ### Further Resources on Regular Expressions
#
# The above discussion is just a quick (and far from complete) treatment of this large topic.
# If you'd like to learn more, I recommend the following resources:
#
# - [Python's ``re`` package Documentation](https://docs.python.org/3/library/re.html): I find that I promptly forget how to use regular expressions just about every time I use them. Now that I have the basics down, I have found this page to be an incredibly valuable resource to recall what each specific character or sequence means within a regular expression.
# - [Python's official regular expression HOWTO](https://docs.python.org/3/howto/regex.html): a more narrative approach to regular expressions in Python.
# - [Mastering Regular Expressions (OReilly, 2006)](http://shop.oreilly.com/product/9780596528126.do) is a 500+ page book on the subject. If you want a really complete treatment of this topic, this is the resource for you.
#
# For some examples of string manipulation and regular expressions in action at a larger scale, see [Pandas: Labeled Column-oriented Data](15-Preview-of-Data-Science-Tools.ipynb#Pandas:-Labeled-Column-oriented-Data), where we look at applying these sorts of expressions across *tables* of string data within the Pandas package.
#
# < [Modules and Packages](13-Modules-and-Packages.ipynb) | [Contents](Index.ipynb) | [A Preview of Data Science Tools](15-Preview-of-Data-Science-Tools.ipynb) >