0

I have a string like this:

"a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more".

I would like to get this as an output:

(("bla", 123, 456), ("bli", 789, 123), ("blu", 789))

I haven't been able to find the proper python regex to achieve that.

1
  • Do you actually want the result result to be a string with parenthesis, or a list of tuples of comma-separated values? Commented Oct 6, 2009 at 21:10

7 Answers 7

1
>>> re.findall(' {{(\w+)\|(\w+)(?:\|(\w+))?}} ', s)
[('bla', '123', '456'), ('bli', '789', '123'), ('blu', '789', '')]

if you still want number there you'd need to iterate over the output and convert it to the integer with int.

Sign up to request clarification or add additional context in comments.

6 Comments

that doesn't match what the question specified. he specifically wanted parentheses around all the data rather than a list of groups. Plus, he wanted double quotes only around the first element in each group and not quotes around the others.
And yet it was accepted... poorly written question, lucky answer?
@Bryan: regex work on strings, they've no idea what numbers are, only know digits. quotes around the data are presentational quotes that indicate that values is a string. As I've clearly said, if OP needs number, he can convert respective values to the integers.
regarding the within quotes requirement: I can only generalise so far. What I see is only example, no where OP indicates that any other patterns are possible.
@SilentGhost: I know they work on strings. I was just trying to clarify because your solution doesn't give what was explicitly asked for. Your solution is probably what he really wanted though, since your question was accepted.
|
1

You need a lot of escapes in your regular expression since {, } and | are special characters in them. A first step to extract the relevant parts of the string would be this:

regex = re.compile(r'\{\{(.*?)\|(.*?)(?:\|(.*?))?\}\}')
regex.findall(line)

For the example this gives:

[('bla', '123', '456'), ('bli', '789', '123'), ('blu', '789', '')]

Then you can continue with converting strings with digits into integers and removing empty strings like for the last match.

1 Comment

Well, if you put {a,b} after a pattern, that is special, and you can omit one or both of a and b there. But I think if you just put "{{" into a pattern, it will just match "{{". I tried it, and it worked for me.
0
[re.split('\|', i) for i in re.findall("{{(.*?)}}", str)]

Returns:

[['bla', '123', '456'], ['bli', '789', '123'], ['blu', '789']]

This method works regardless of the number of elements in the {{ }} blocks.

Comments

0

To get the exact output you wrote, you need a regex and a split:

import re
map(lambda s: s.split("|"), re.findall(r"\{\{([^}]*)\}\}", s))

To get it with the numbers converted, do this:

toint = lambda x: int(x) if x.isdigit() else x
[map(toint, p.split("|")) for p in re.findall(r"\{\{([^}]*)\}\}", s)]

Comments

0

Assuming your actual format is {{[a-z]+|[0-9]+|[0-9]+}}, here's a complete program with conversion to ints.

import re

s = "a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more"
result = []

for match in re.finditer('{{.*?}}', s):

   # Split on pipe (|) and filter out non-alphanumerics
   parts = [filter(str.isalnum, part) for part in match.group().split('|')]

   # Convert to int when possible
   for index, part in enumerate(parts):      
      try:
         parts[index] = int(part)
      except ValueError:
         pass

   result.append(tuple(parts))

Comments

0

We might be able to get fancy and do everything in a single complicated regular expression, but that way lies madness. Let's do one regexp that grabs the groups, and then split the groups up. We could use a regexp to split the groups, but we can just use str.split(), so let's do that.

import re
pat_group = re.compile("{{([^}]*)}}")
def mixed_tuple(iterable):
    lst = []
    for x in iterable:
        try:
            lst.append(int(x))
        except ValueError:
            lst.append(x)
    return tuple(lst)

s = "a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more"

lst_groups = re.findall(pat_group, s)
lst = [mixed_tuple(x.split("|")) for x in lst_groups]

In pat_group, "{{" just matches literal "{{". "(" starts a group. "[^}]" is a character class that matches any character except for "}", and '*' allows it to match zero or more such characters. ")" closes out the group and "}}" matches literal characters. Thus, we match the "{{...}}" patterns, and can extract everything between the curly braces as a group.

re.findall() returns a list of groups matched from the pattern.

Finally, a list comprehension splits each string and returns the result as a tuple.

Comments

0

Is pyparsing overkill for this? Maybe, but without too much suffering, it does deliver the desired output, without a thicket of backslashes to escape the '{', '|', or '}' characters. Plus, there's no need for post-parse conversions of integers and whatnot - the parse actions take care of this kind of stuff at parse time.

from pyparsing import Word, Suppress, alphas, alphanums, nums, delimitedList

LBRACE,RBRACE,VERT = map(Suppress,"{}|")
word = Word(alphas,alphanums)
integer = Word(nums)
integer.setParseAction(lambda t: int(t[0]))

patt = (LBRACE*2 + delimitedList(word|integer, VERT) + RBRACE*2)
patt.setParseAction(lambda toks:tuple(toks.asList()))


s = "a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more"

print tuple(p[0] for p in patt.searchString(s))

Prints:

(('bla', 123, 456), ('bli', 789, 123), ('blu', 789))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.