20 06 2012
Working with regexp begins with importing the “re” module.
Now you are ready to perform searches, but forget about the lovely inline RegExp literals like
/zork/gi. Python leaves us with normal, boring strings e.g.,
"zork" and flags also are not the part of the expression, but are defined in a different place. As you will shortly see for yourself.
Due to the fact that the expression is a string, you will quickly get tired of using escapes in your regexp. The good news is that python has raw-strings, which can help you with this problem. Just prepend the ‘r’ symbol before a string and backslashes will be interpreted as characters and not escapes:
# python: str = r"regexp\."
Now the basics are covered, it’s time to start. I propose we start this journey with a simple regexp test, which most of us perform everyday.
// js: var r = /bork/.test("dork bork fork zork"); (r === true);
Python doesn’t provide any direct equivalent instead, the
search method can be used to accomplish the desired task. This method returns None if nothing is found and this is exactly what we can use for our test. I will explain about the
search method a bit later, but for now this will work as a
# python: r = (re.search(r"regexp", "someString") != None);
Well, a test is a fine start, but a regexp is often used for real searching. And while you would expect that I will now tell you more about the
search method, you will be disappointed 🙂
Next, we are going to search for all ( using GLOBAL flag ) occurrences of a regular expression in a string, but with a different method. If you have noticed, I’ve put an extra accent the GLOBAL word because the
From this snippet you can see that
match omits groups and only entire matches are returned when global is on and shows group information when g is on.
On the other hand python is “more” consistent with return values. It doesn’t omit groups, instead it ONLY returns them. Which means that if you want to have a simple array of all matches, groups SHOULD be uncaptured ( by using ?: after parenthesis ).
# python import re re.findall(r"(?:b|z)ork", "dork bork fork zork") == [ "bork", "zork" ] re.findall(r"(?:b|z)ork", "dork") == [ ]
UPDATE: In python an empty array evaluates to false, which means you can use the if construct:
# python if : print "Will never be called"
Forgotten uncaptured groups provide us with different results:
# python: import re re.findall(r"(b|z)ork", "dork bork fork zork") == [ "b", "z" ]
See? This result has only groups in it and not an entire match.
Sometimes I use a workaround which gives me more powerful version – wrap entire regexp in a group and it will give the whole match as a first item in your tuple. Since it’s a quirk there is also a normal way of doing this in python, but I will tell about it later. First, the quirk example:
# python: import re re.findall(r"((b|z)ork)", "dork bork fork zork") == [('bork', 'b'), ('zork', 'z')]
replace, but that’s a different story. Still there are a couple of methods to go.
In python you would use
search as an equivalent. There is also the
match method, which is similar to
search, but it’s slightly limited. You can read more about it here
# python: import re result = re.search(r"(b|z)ork","dork bork fork zork") # result == MatchObject instance # full match of the regexp match = result.group(0) # 0 can be omitted - result.group() will do the same group1 = result.group(1) # groupN = result.group(n) re.search(r"(b|z)ork","dork") == None
replace method is used for such situations.
To achieve all group matches in python you can use the
finditer method, which will return an iterator with MatchObject instances.
# python: iter = re.finditer(r"(b|z)ork", "dork bork fork zork"); for result in iter: match = result.group() group1 = result.group(1)
Okidoki. The basics are covered. Now for the last part: flags, where do you put them and how do you use them.
# python import re p = re.compile(r"(b|z)ork", re.IGNORECASE) p.search("dork bork fork zork")
That finished my introduction. Before I go, I would like to give some advice about the
finditer instead of
findall. First of all, you will save yourself time by not fixing captured groups all the time and the second reason is that
finditer is much more powerfull and can be used for a broader scope of problems.
Thank you for your attention.
Links to read:
I see that this is an ancient article in internet terms, but thank you – it’s just what I was looking for!