0

What is the correct regex statement using re.search() to find and return a file extension in a string.

Such as: (.+).(avi|rar|zip|txt)

I need it to search a string and if it contains any of those avi, rar, etc) return just that extension.

Thanks!

EDIT: should add that is needs to be case insensitive

1
  • Do you really want to search a string for the first occurrence of something like .avi or do you want to check that a string ends with that? Asking another way, is the string general text "Fred sent me foo.rar today" or is it supposed to contain a file name or path whose extension you want to extract? Commented Oct 11, 2010 at 19:49

6 Answers 6

8

the standard library is better ;)

>>> os.path.splitext('hello.py')
('hello', '.py')
Sign up to request clarification or add additional context in comments.

Comments

6

You need:

(.)\.(avi|rar|zip|txt)$

Note the backslash to escape the dot. This will make it look for a literal dot rather than any character.

To make it case insensitive, use the RE.I flag in your search call.

re.search(r'(.)\.(avi|rar|zip|txt)$', string, re.I)

10 Comments

So is there also a flag that makes the Python interpreter case-insensitive? Otherwise we have to import re as RE to be able to find RE.I...
You can make it vaguely more efficient and less vaguely more precisely what's being looked for by changing it to .\.(avi|rar|zip|txt)$: this will ensure that there's some character before the dot, and that the file extension is at the end of the string. This way you end up with the first match being the extension rather than the second one, and you don't end up keeping a match that you don't need.
@Nick T: the re.I flag is just for the regular expressions module. I'm not aware of a way to make the rest of python case-insensitive.
@JoshD: I was making some (fail) joke at you messing up case with a flag that sets case-insensitivity. (RE.I instead of re.I)
@Nick T: Well, nuts. Now I look quite the fool. I fixed the answer, though.
|
1

Short interactive run:

>>> import re
>>> pat="(.+)\.(avi|rar|zip|txt)"
>>> re.search(pat, "abcdefg.zip", re.IGNORECASE).groups()
('abcdefg', 'zip')
>>> re.search(pat, "abcdefg.ZIP", re.IGNORECASE).groups()
('abcdefg', 'ZIP')
>>> 

1 Comment

In this particular case, it's a non issue, but it is recommended for regex literals to be raw strings, to avoid double escaping. use r"(.+)\.(avi|rar|zip|txt)"
0
(.+)[.](avi|rar|zip|txt)

Then the group 2 will be extension.

I have just written a blog about Regular Expression http://blogs.appframe.com/erikv/2010-09-23-Regular-Expression if you want to read more about this.

Comments

0

Since I think regex is evil...

def return_extension(filename):
    '''(This functions assumes that filenames such as `.foo` have extension
    `foo`.)
    '''
    tokens = filename.split('.')

    return '' if len(tokens) == 1 else tokens[-1]

...I advocate simply parsing the filename.

1 Comment

Reinventing the wheel but not reinventing the axle is even more evil.
0

If you know that the extension is at the very end of the string, this should work well:

.\.(avi|rar|zip|txt)$
  • The first bit will ensure that there's some character before the dot.

  • The $ specifies that the file extension is at the end of the string, i.e. the $ means "the string ends here". For gory details on this, including some edge cases with newlines that you should be aware of see the comment discussion for JoshD's answer, as well as the entry for $ in the docs.

So then the only entry in the match.groups() tuple, i.e. match.groups()[0], will be the extension itself.

6 Comments

@intuited: -1. s/some edge cases/FAIL/
@John Machin: Crap, really? I can't think of any. What's an example?
@intuited: """The justification for blah\Z in the default non-multiline mode is that re.match("blah$", "blah\n") will not return None"""
@John Machin: I think you need to re-read my answer, specifically the caveat that "you know that the extension is at the very end of the string". This is a pretty common use case (e.g. you've read in and done split('\n') on a file listing from a file or pipeline), so it seems worth giving a specific solution for it. In this case I think it's actually better to use the $ because it's compatible with fileinput.input() without having to rstrip the lines first.
@intuited: I did read that first line, twice, and decided twice not to take issue with it. Third time unlucky: How can one KNOW that the extension is at the very end of the string? In any case, whether you think that you know or not, \Z does the job reliably. Another way of looking at it is that $ is a perlish substitute for \n?\Z ... fileinput.input()? oh, yeah, I remember, Python crutch for awk tragics -- I stopped using it some time in 1998.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.