stairway to success with Python

Regular Expressions

In our introduction to regular expressions we have covered the basic aspects of regular expressions. We have seen, what the simplest regular expression looks like. We have also learnt, how to use regular expressions in Python by using the search() and the match() methods of the re module. You should be familiar with formulating and using character classes and the predefined character classes like \d, \D, \s, \S, and so on. You should have learnt how to match the beginning and the end of a string. You should know the special meaning of the question mark to make items optional. We have also introduced the quantifiers to repeat characters and groups arbitrarily or in certain ranges.

You should also be familiar with grouping and backreferences. Furthermore, we had explained the match objects of the re module and the information they contain and how to retrieve this information by using the methods span(), start(), end(), and group().

The introduction ended with a comprehensive example in Python.

In this chapter we will continue with our explanations of the syntax of the regular expressions. We will also explain further methods of the Python module re.

Finding all Matched Substrings

The Python module re provides another great method, which other languages like Perl and Java don't provide.
re.findall(pattern, string[, flags])
findall returns all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order in which they are found.
>>> t="A fat cat doesn't eat oat but a rat eats bats."
>>> mo = re.findall("[force]at", t)
>>> print mo
['fat', 'cat', 'eat', 'oat', 'rat', 'eat']
If one or more groups are present in the pattern, findall returns a list of groups; this will be a list of tuples if the pattern has more than one group.
>>> import re
>>> items = re.findall("[0-9]+.*: .*", "Customer number: 232454, Date: February 12, 2011")
>>> print items
['232454, Date: February 12, 2011']
>>> items = re.findall("([0-9]+).*: (.*)", "Customer number: 232454, Date: February 12, 2011")
>>> print items
[('232454', 'February 12, 2011')]


In our introduction to regular expressions we had introduced character classes. Character classes offer a choice out of a set of characters. Sometimes we need a choice between several regular expression. It's a logical or and that's why the symbol for this construct is the "|" symbol.
In the following example, we check, if one of the cities London, Paris, Zurich or Strasbourg appear in a string preceded by the word "destination":
 >>> import re
>>> str = "The destination is London!"
>>> mo ="destination.*(London|Paris|Zurich|Strasbourg)",str)
>>> if mo: print
destination is London
If you think the previous example is too artificial, here is another one. Let's assume, tha you want to filter your email. You want to find all the correspondence (conversations) between you and Guido van Rossum, the creator and designer of Python. The following regular expression is helpful for this purpose:
r"(^To:|^From:) (Guido|van Rossum)"
This expression matches all lines starting with either 'To:' or 'From:', followed by a space and then either by the first name 'Guido' or the surname 'van Rossum'.

Compiling Regular Expressions

If you want to use the same regexp more than once in a script, it might be a good idea to use a regular expression object, i.e. the regex is compiled.
The general syntax:
re.compile(pattern[, flags])
compile returns a regex object, which can be used later for searching and replacing. The expressions behaviour can be modified by specifying a flag value.

AbbreviationFull nameDescription
re.Ire.IGNORECASEMakes the regular expression case-insensitive
re.Lre.LOCALEThe behaviour of some special sequences like \w, \W, \b,\s, \S will be made dependant on the current locale, i.e. the user's language, country aso.
re.Mre.MULTILINE^ and $ will match at the beginning and at the end of each line and not just at the beginning and the end of the string
re.Sre.DOTALLThe dot "." will match every character plus the newline
re.Ure.UNICODEMakes \w, \W, \b, \B, \d, \D, \s, \S dependent on Unicode character properties
re.Xre.VERBOSEAllowing "verbose regular expressions", i.e. whitespace are ignored. This means that spaces, tabs, and carriage returns are not matched as such. If you want to match a space in a verbose regular expression, you'll need to escape it by escaping it with a backslash in front of it or include it in a character class.
# are also ignored, except when in a character class or preceded by an non-escaped backslash. Everything following a "#" will be ignored until the end of the line, so this character can be used to start a comment.

Compiled regular objects usually are not saving much time, because Python internally compiles AND CACHES regexes whenever you use them with or re.match(). The only extra time a non-compiled regex takes is the time it needs to check the cache, which is a key lookup of a dictionary.

A good reason to use them is to separate the definition of a regex from its use.


We have already introduced a regular expression for matching a superset of UK postcodes in our introductory chapter:
r"[A-z]{1,2}[0-9R][0-9A-Z]? [0-9][ABD-HJLNP-UW-Z]{2}"

We demonstrate with this regular expression, how we can use the compile functionality of the module re in the following interactive session. The regular expression "regex" is compiled with re.compile(regex) and the compiled object is saved in the object compiled_re. Now we call the method search() of the object compiled_re:
>>> import re
>>> regex = r"[A-z]{1,2}[0-9R][0-9A-Z]? [0-9][ABD-HJLNP-UW-Z]{2}"
>>> address = "BBC News Centre, London, W12 7RJ"
>>> compiled_re = re.compile(regex)
>>> res =
>>> print res
<_sre.SRE_Match object at 0x7fc9f688f6b0>

Splitting a String With or Without with Regular Expressions

There is a string method split, which can be used to split a string into a list of substrings.

str.split([sep[, maxsplit]])

As you can see, the method split has two optional parameters. If none is given (or is None) , a string will be separated into substring using whitespaces as delimiters, i.e. every substring consisting purely of whitespaces is used as a delimiter.

Splitting an image

We demonstrate this behaviour with a famous quotation by Abraham Lincoln:
>>> law_courses = "Let reverence for the laws be breathed by every American mother to the lisping babe that prattles on her lap. Let it be taught in schools, in seminaries, and in colleges. Let it be written in primers, spelling books, and in almanacs. Let it be preached from the pulpit, proclaimed in legislative halls, and enforced in the courts of justice. And, in short, let it become the political religion of the nation."
>>> law_courses.split()
['Let', 'reverence', 'for', 'the', 'laws', 'be', 'breathed', 'by', 'every', 'American', 'mother', 'to', 'the', 'lisping', 'babe', 'that', 'prattles', 'on', 'her', 'lap.', 'Let', 'it', 'be', 'taught', 'in', 'schools,', 'in', 'seminaries,', 'and', 'in', 'colleges.', 'Let', 'it', 'be', 'written', 'in', 'primers,', 'spelling', 'books,', 'and', 'in', 'almanacs.', 'Let', 'it', 'be', 'preached', 'from', 'the', 'pulpit,', 'proclaimed', 'in', 'legislative', 'halls,', 'and', 'enforced', 'in', 'the', 'courts', 'of', 'justice.', 'And,', 'in', 'short,', 'let', 'it', 'become', 'the', 'political', 'religion', 'of', 'the', 'nation.']

Now we look at a string, which could stem from an Excel or an OpenOffice calc file. We have seen in our previous example that split takes whitespaces as default separators. We want to split the string in the following little example using semicolons as separators. The only thing we have to do is to use ";" as an argument of split():
>>> line = "James;Miller;teacher;Perl"
>>> line.split(";")
['James', 'Miller', 'teacher', 'Perl']
The method split() has another optional parameter: maxsplit
If maxsplit is given, at most maxsplit splits are done. This means that the resulting list will have at most "maxsplit + 1" elements.
We will illustrate the mode of operation of maxsplit in the next example:
>>> mammon = "The god of the world's leading religion. The chief temple is in the holy city of New York."
>>> mammon.split(" ",3)
['The', 'god', 'of', "the world's leading religion. The chief temple is in the holy city of New York."]
We used a Blank as a delimiter string in the previous example, which can be a problem: If multiple blanks or whitespaces are connected, split() will split the string after every single blank, so that we will get emtpy strings and strings with only a tab inside ('\t') in our result list:
>>> mammon = "The god  \t of the world's leading religion. The chief temple is in the holy city of New York."
>>> mammon.split(" ",5)
['The', 'god', '', '\t', 'of', "the world's leading religion. The chief temple is in the holy city of New York."]
We can prevent the separation of empty strings by using None as the first argument. Now split will use the default behaviour, i.e. every substring consisting of connected whitespace characters will be taken as one separator:
>>> mammon.split(None,5)
['The', 'god', 'of', 'the', "world's", 'leading religion. The chief temple is in the holy city of New York.']

Regular Expression Split

The string method split() is the right tool in many cases, but what, if you want e.g. to get the bare words of a text, i.e. without any special characters and whitespaces. If we want this, we have to use the split function from the re module. We illustrate this method with a short text from the beginning of Metamorphoses by Ovid:
>>> import re
>>> metamorphoses = "OF bodies chang'd to various forms, I sing: Ye Gods, from whom these miracles did spring, Inspire my numbers with coelestial heat;"
>>> re.split("\W+",metamorphoses)
['OF', 'bodies', 'chang', 'd', 'to', 'various', 'forms', 'I', 'sing', 'Ye', 'Gods', 'from', 'whom', 'these', 'miracles', 'did', 'spring', 'Inspire', 'my', 'numbers', 'with', 'coelestial', 'heat', '']
The following example is a good case, where the regular expression is really superior to the string split. Let''s assume that we have data lines with surnames, first names and professions of names. We want to clear the data line of the superfluous and redundant text descriptions, i.e. "surname: ", "prename: " and so on, so that we have solely the surname in the first column, the first name in the second column and the profession in the third column:
>>> import re
>>> lines = ["surname: Obama, prename: Barack, profession: president", "surname: Merkel, prename: Angela, profession: chancellor"]
>>> for line in lines:
...     re.split(",* *\w*: ", line)
['', 'Obama', 'Barack', 'president']
['', 'Merkel', 'Angela', 'chancellor']
We can easily improve the script by using a slice operator, so that we don't have the empty string as the first element of our result lists:
>>> import re
>>> lines = ["surname: Obama, prename: Barack, profession: president", "surname: Merkel, prename: Angela, profession: chancellor"]
>>> for line in lines:
...     re.split(",* *\w*: ", line)[1:]
['Obama', 'Barack', 'president']
['Merkel', 'Angela', 'chancellor']

Search and Replace with sub

re.sub(regex, replacement, subject)
Every match of the regular expression regex in the string subject will be replaced by the string replacement.
>>> import re
>>> str = "yes I said yes I will Yes."
>>> res = re.sub("[yY]es","no", str)
>>> print res
no I said no I will no.