Regex for extraction in Python

Question

I have a string like this:

"a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more".

I would like to get this as an output:

(("bla", 123, 456), ("bli", 789, 123), ("blu", 789))

I haven't been able to find the proper python regex to achieve that.

Do you actually want the result result to be a string with parenthesis, or a list of tuples of comma-separated values? — Bryan Oakley
– Bryan Oakley, Commented Oct 6, 2009 at 21:10

SilentGhost · Accepted Answer · 2009-10-06 20:56:55Z

1

>>> re.findall(' {{(\w+)\|(\w+)(?:\|(\w+))?}} ', s)
[('bla', '123', '456'), ('bli', '789', '123'), ('blu', '789', '')]

if you still want number there you'd need to iterate over the output and convert it to the integer with int.

answered Oct 6, 2009 at 20:56

SilentGhost

322k67 gold badges312 silver badges294 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Bryan Oakley Over a year ago

that doesn't match what the question specified. he specifically wanted parentheses around all the data rather than a list of groups. Plus, he wanted double quotes only around the first element in each group and not quotes around the others.

Cascabel Over a year ago

And yet it was accepted... poorly written question, lucky answer?

SilentGhost Over a year ago

@Bryan: regex work on strings, they've no idea what numbers are, only know digits. quotes around the data are presentational quotes that indicate that values is a string. As I've clearly said, if OP needs number, he can convert respective values to the integers.

SilentGhost Over a year ago

regarding the within quotes requirement: I can only generalise so far. What I see is only example, no where OP indicates that any other patterns are possible.

Bryan Oakley Over a year ago

@SilentGhost: I know they work on strings. I was just trying to clarify because your solution doesn't give what was explicitly asked for. Your solution is probably what he really wanted though, since your question was accepted.

|

sth · Accepted Answer · 2009-10-06 21:43:44Z

1

You need a lot of escapes in your regular expression since {, } and | are special characters in them. A first step to extract the relevant parts of the string would be this:

regex = re.compile(r'\{\{(.*?)\|(.*?)(?:\|(.*?))?\}\}')
regex.findall(line)

For the example this gives:

[('bla', '123', '456'), ('bli', '789', '123'), ('blu', '789', '')]

Then you can continue with converting strings with digits into integers and removing empty strings like for the last match.

edited Oct 6, 2009 at 21:43

answered Oct 6, 2009 at 20:54

sth

231k56 gold badges288 silver badges370 bronze badges

1 Comment

steveha Over a year ago

Well, if you put {a,b} after a pattern, that is special, and you can omit one or both of a and b there. But I think if you just put "{{" into a pattern, it will just match "{{". I tried it, and it worked for me.

Jeff B · Accepted Answer · 2009-10-06 21:02:03Z

0

[re.split('\|', i) for i in re.findall("{{(.*?)}}", str)]

Returns:

[['bla', '123', '456'], ['bli', '789', '123'], ['blu', '789']]

This method works regardless of the number of elements in the {{ }} blocks.

answered Oct 6, 2009 at 21:02

Jeff B

30.1k7 gold badges64 silver badges91 bronze badges

Comments

Joakim Lundborg · Accepted Answer · 2009-10-06 21:10:02Z

0

To get the exact output you wrote, you need a regex and a split:

import re
map(lambda s: s.split("|"), re.findall(r"\{\{([^}]*)\}\}", s))

To get it with the numbers converted, do this:

toint = lambda x: int(x) if x.isdigit() else x
[map(toint, p.split("|")) for p in re.findall(r"\{\{([^}]*)\}\}", s)]

edited Oct 6, 2009 at 21:10

answered Oct 6, 2009 at 21:02

Joakim Lundborg

11.3k7 gold badges34 silver badges41 bronze badges

Comments

Kenan Banks · Accepted Answer · 2009-10-06 21:14:40Z

0

Assuming your actual format is {{[a-z]+|[0-9]+|[0-9]+}}, here's a complete program with conversion to ints.

import re

s = "a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more"
result = []

for match in re.finditer('{{.*?}}', s):

   # Split on pipe (|) and filter out non-alphanumerics
   parts = [filter(str.isalnum, part) for part in match.group().split('|')]

   # Convert to int when possible
   for index, part in enumerate(parts):      
      try:
         parts[index] = int(part)
      except ValueError:
         pass

   result.append(tuple(parts))

answered Oct 6, 2009 at 21:14

Kenan Banks

213k36 gold badges160 silver badges176 bronze badges

Comments

steveha · Accepted Answer · 2009-10-06 21:30:01Z

We might be able to get fancy and do everything in a single complicated regular expression, but that way lies madness. Let's do one regexp that grabs the groups, and then split the groups up. We could use a regexp to split the groups, but we can just use str.split(), so let's do that.

import re
pat_group = re.compile("{{([^}]*)}}")
def mixed_tuple(iterable):
    lst = []
    for x in iterable:
        try:
            lst.append(int(x))
        except ValueError:
            lst.append(x)
    return tuple(lst)

s = "a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more"

lst_groups = re.findall(pat_group, s)
lst = [mixed_tuple(x.split("|")) for x in lst_groups]

In pat_group, "{{" just matches literal "{{". "(" starts a group. "[^}]" is a character class that matches any character except for "}", and '*' allows it to match zero or more such characters. ")" closes out the group and "}}" matches literal characters. Thus, we match the "{{...}}" patterns, and can extract everything between the curly braces as a group.

re.findall() returns a list of groups matched from the pattern.

Finally, a list comprehension splits each string and returns the result as a tuple.

PaulMcG · Accepted Answer · 2009-10-06 23:09:24Z

Is pyparsing overkill for this? Maybe, but without too much suffering, it does deliver the desired output, without a thicket of backslashes to escape the '{', '|', or '}' characters. Plus, there's no need for post-parse conversions of integers and whatnot - the parse actions take care of this kind of stuff at parse time.

from pyparsing import Word, Suppress, alphas, alphanums, nums, delimitedList

LBRACE,RBRACE,VERT = map(Suppress,"{}|")
word = Word(alphas,alphanums)
integer = Word(nums)
integer.setParseAction(lambda t: int(t[0]))

patt = (LBRACE*2 + delimitedList(word|integer, VERT) + RBRACE*2)
patt.setParseAction(lambda toks:tuple(toks.asList()))


s = "a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more"

print tuple(p[0] for p in patt.searchString(s))

Prints:

(('bla', 123, 456), ('bli', 789, 123), ('blu', 789))

Collectives™ on Stack Overflow

Regex for extraction in Python

7 Answers 7

6 Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

6 Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related