Notes of Maks Nemisj

Experiments with JavaScript

Introduction into regular expressions in Python ( for Javascripters )

^

This article is dedicated to regular expressions in python based on javascript knowledge. This is the follow up to the article “Javascript to Python API reference guide”.

There is not much of python RegExps that you could directly map to javascript. For this reason i’ve decided to write an introduction article to regexp in python, by using your current knowledge of JavaScript RegExps. It would still be possible to reverse-engineer this article to get the knowledge about JavaScript based on Python examples ;). But enough talking, let’s start diving.

(Main)

Working with regexp begins with importing the “re” module.

    import re

Now you are ready to perform searches, but forget about the lovely inline RegExp literals like /zork/gi. Python leaves us with normal, boring strings e.g., "zork" and flags also are not the part of the expression, but are defined in a different place. As you will shortly see for yourself.

Due to the fact that the expression is a string, you will quickly get tired of using escapes in your regexp. The good news is that python has raw-strings, which can help you with this problem. Just prepend the ‘r’ symbol before a string and backslashes will be interpreted as characters and not escapes:

    # python:
    str = r"regexp\."

Now the basics are covered, it’s time to start. I propose we start this journey with a simple regexp test, which most of us perform everyday.

    // js:
    var r = /bork/.test("dork bork fork zork");
    (r === true);

Python doesn’t provide any direct equivalent instead, the search method can be used to accomplish the desired task. This method returns None if nothing is found and this is exactly what we can use for our test. I will explain about the search method a bit later, but for now this will work as a test equiualent.

    # python:
    r = (re.search(r"regexp", "someString") != None);

Well, a test is a fine start, but a regexp is often used for real searching. And while you would expect that I will now tell you more about the search method, you will be disappointed 🙂

Next, we are going to search for all ( using GLOBAL flag ) occurrences of a regular expression in a string, but with a different method. If you have noticed, I’ve put an extra accent the GLOBAL word because the match method in javascript returns different information whenever the global flag and groups are used. Example:

    // javascript:
    "dork bork fork zork".match(/(b|z)ork/g) == [ "bork", "zork" ]
    "dork bork fork zork".match(/(b|z)ork/) == [ "bork", "b" ]

From this snippet you can see that match omits groups and only entire matches are returned when global is on and shows group information when g is on.

On the other hand python is “more” consistent with return values. It doesn’t omit groups, instead it ONLY returns them. Which means that if you want to have a simple array of all matches, groups SHOULD be uncaptured ( by using ?: after parenthesis ).

    # python
    import re
    re.findall(r"(?:b|z)ork", "dork bork fork zork") == [ "bork", "zork" ]
    re.findall(r"(?:b|z)ork", "dork") == [ ]

Please also take a look at the return value of the last call. When no match is found empty array is returned and NOT NULL like in javascript.

UPDATE: In python an empty array evaluates to false, which means you can use the if construct:

   # python
   if []: 
       print "Will never be called"

Forgotten uncaptured groups provide us with different results:

    # python:
    import re
    re.findall(r"(b|z)ork", "dork bork fork zork") == [ "b", "z" ]

See? This result has only groups in it and not an entire match.

Sometimes I use a workaround which gives me more powerful version – wrap entire regexp in a group and it will give the whole match as a first item in your tuple. Since it’s a quirk there is also a normal way of doing this in python, but I will tell about it later. First, the quirk example:

    # python:
    import re
    re.findall(r"((b|z)ork)", "dork bork fork zork") == [('bork', 'b'), ('zork', 'z')]

That’s fun, isn’t it, javascript has no direct mapping to such an extended result, you could achieve the same with replace, but that’s a different story. Still there are a couple of methods to go.

The next question is, how would you, in a pythonic way, get groups and the whole match result. Let’s start with a simple, non global version. In javascript it’s a matter of taking away the ‘g’ flag, right?

    // javascript:
    var result = "dork bork fork zork".match(/(b|z)ork/);
    result == [ "bork", "b" ];
    // full match of the regexp
    var match  = result[0]; // equals "bork"
    var group1 = result[1]; // equals "b"
    // var groupN = result[N];

In python you would use search as an equivalent. There is also the match method, which is similar to search, but it’s slightly limited. You can read more about it here

    # python:
    import re
    result = re.search(r"(b|z)ork","dork bork fork zork")
    # result == MatchObject instance
    # full match of the regexp
    match  = result.group(0) # 0 can be omitted  - result.group() will do the same
    group1 = result.group(1)
    # groupN = result.group(n)
    re.search(r"(b|z)ork","dork") == None

While you are used to working with arrays of strings in javascript, python gives you access to the MatchObject itself. This object has a lot of extra information which you can use when doing regexp matches, just read the manual.

I think your next question is, what is a pythonic way of doing this for all matches? As you know javascript doesn’t have one and often the replace method is used for such situations.

    // javascript :
    "dork bork fork zork".match(/(b|z)ork/g, function(match, group1 ...groupN, pos, full_str) {
        // use position, group information, etc
    });

To achieve all group matches in python you can use the finditer method, which will return an iterator with MatchObject instances.

    # python:
    iter = re.finditer(r"(b|z)ork", "dork bork fork zork");
    for result in iter:
        match  = result.group()
        group1 = result.group(1)

Okidoki. The basics are covered. Now for the last part: flags, where do you put them and how do you use them.

Normally, regexps in python are compiled before being used. I haven’t used this feature ’cause I wanted examples to be as close as possible to the javascript ones. When regexps are compiled they have the same methods which I’ve already covered.

    # python
    import re
    p = re.compile(r"(b|z)ork", re.IGNORECASE)
    p.search("dork bork fork zork")

That finished my introduction. Before I go, I would like to give some advice about the findall method. Despite the fact that it’s easier to map it to your javascript knowledge, I would recommend that you use finditer instead of findall. First of all, you will save yourself time by not fixing captured groups all the time and the second reason is that finditer is much more powerfull and can be used for a broader scope of problems.

Thank you for your attention.

Links to read:

$

, ,

One thought on “Introduction into regular expressions in Python ( for Javascripters )

  • Jon says:

    I see that this is an ancient article in internet terms, but thank you – it’s just what I was looking for!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.