Regex basics

{tip}
The best ways to learn and use regex are:
1. [https://regexone.com/}(https://regexone.com) is so good, I'm loath to add anything else to this page.
2. The [official python documentation](https://docs.python.org/3/library/re.html) and its [HOWTO page](https://docs.python.org/3/howto/regex.html)
3. Google+stackoverflow. "Has someone done something similar? Yes? Great!"

Imagine you have a webpage or document which includes (buried in the text) a bunch of numbers. How can you collect all the phone numbers?

A: Look for all the instances of this pattern: (###) ###-####.

Your eyeballs can easily do that, but once the job involves enough enough numbers, it makes sense to let your computer do it for you.

Regex is how you tell a computer to search for any pattern within a string.

  • Phone numbers
  • Emails (regex is why people don't spell out their emails "correctly" online)
  • Addresses
  • Certain words/topics (like assignment 5!)

Regex in Python

Regex is a skill that works in all programming languages, so this lesson is portable - you can use regex in R or whatever your language of choice is.

But obviously, we're going to use python. Run import re to load the regex package.

Common functions:

The full list of functions is here.

  • re.search(pattern, string, karg**) looks for the first instance of the regex pattern within the string and returns a "match object" if one is found. Returns None if no match.
In [1]:
import re
re.search("c", "abcdef")
Out[1]:
<re.Match object; span=(2, 3), match='c'>
  • re.findall(pattern,string) returns a list of matching strings, and is how you can count the number of matches
In [2]:
text = "He was carefully disguised but captured quickly by police."
re.findall(r"\w+ly", text)
Out[2]:
['carefully', 'quickly']
In [3]:
len(re.findall(r"\w+ly", text))
Out[3]:
2
  • re.finditer(pattern,string) is similar to findall but gives you a list of match objects, which is nice if you want to get more info about the matches than just the string
In [4]:
# i want to find all of the adverbs AND THEIR POSITIONS in some text
text = "He was carefully disguised but captured quickly by police."
for m in re.finditer(r"\w+ly", text):
    print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
07-16: carefully
40-47: quickly
  • pattern_to_use = re.compile(pattern) will create a pattern you can put as the input to search, find, and findall.
  • .group(#) if your search or match has parenthesized subgroups, you can access each parenthetical.
In [5]:
m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")  
m.group(0)       # The entire match
Out[5]:
'Isaac Newton'
In [6]:
m.group(1)       # The first parenthesized subgroup.
Out[6]:
'Isaac'
In [7]:
m.group(2)       # The second parenthesized subgroup.
Out[7]:
'Newton'
In [8]:
m.group(1, 2)    # Multiple arguments give us a tuple.
Out[8]:
('Isaac', 'Newton')

Special characters to build your patterns

Most of this is taken directly from the official documentation.

Char Matches
. any character except a newline
^ start of the string

^[a-z]+ matches the "hi" in "hi there" but not "there"
$ end of the string or just before the newline at the end of the string

foo matches both 'foo' and 'foobar', but foo$ matches only 'foo'
* match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match "a", "ab", or "abbbbbbb"
+ match 1 or more repetitions of the preceding RE, as many repetitions as are possible. ab+ will match "ab", or "abbbbbbb" but not "a"
? match 0 or 1 repetitions of the preceding RE. ab? will match "a" and "ab"
{m} match m repetitions of the preceding RE. ab{3} will match "abbb" but not "abb"
{m,n} match m to n repetitions of the preceding RE. ab{3,5} will match "abbb" and "abbbbb" but not "abb"

Note: Do you want the largest match or the smallest?

  • *, +, {m} and {m,n} are GREEDY: they match as much text as possible. So if you search ab+ against "abbb" it will match the full string "abbb". But sometimes you want
  • If you add ? to any of those, it will perform the match in a minimal way: using ab+ on string "abbbbb" will just return "ab". Use ab* instead and you'll get "a".
Char Matches
\ 1. escapes special characters \* will actuallye search for an asterisk.

2. or signals a "special sequence"
[] Indicates a set of characters. In a set: [amk] will match 'a', 'm', or 'k'.

Common ranges: [a-z], [A-Z], [0-9]. You can combine ranges: [A-Za-z0-9].

Special characters lose their special meaning inside sets. For example, [(+*)] will match any of the literal characters (, +, *, or ).
(...) Makes a group. POWERFUL and necessary in most uses of regex.

If you actually want to match parentheses, use a backslash: \(

There is SO MANY more special characters. If you can imagine a "feature" in the pattern of a string, there is probably a special character. \b matches word boundaries, \d for digits,\s` for whitespace, and more.

{tip}
Most "regex" in practice is just Googling for someone who has done a similar thing.

A few pointers:

  • You only benefit from using re.compile when you are creating a bunch of regex patterns. In that case, you "compile" them and can immediately use them all quickly. But if you only have a few patterns, don't bother.
  • re.match is similar to re.search, but only starts at the beginning of the string. I don't use match almost ever.

```{admonition} Raw string notation

You'll often see people put an "r" in from of the regex pattern. For example: re.search(r"c", "abcdef").


**Raw string notation (`r"text"`) keeps regular expressions sane. Without it, every backslash ('\') in a regular expression would have to be prefixed with another one to escape it.**

```PYTHON
# These lines are functionally identical
>>> re.match(r"\W(.)\1\W", " ff ")
<re.Match object; span=(0, 4), match=' ff '>
>>> re.match("\\W(.)\\1\\W", " ff ")
<re.Match object; span=(0, 4), match=' ff '>

# so are these:
>>> re.match(r"\\", r"\\")
<re.Match object; span=(0, 1), match='\\'>
>>> re.match("\\\\", r"\\")
<re.Match object; span=(0, 1), match='\\'>