Python regex statement

Question

I'd like to create a regex statement in Python 2.7.8 that will substitute characters. It will work like this...

ó -> o
ú -> u
é -> e
á -> a
í -> i
ù,ú  -> u

These are the only unicode characters that I would like to change. Such unicode characters as, ë, ä I don't want to change. So the word, thójlà will become tholja. I'm sure there is a way so that I don't have to create all the regex separately like below.

word = re.sub(ur'ó', ur'o', word)
word = re.sub(ur'ú', ur'u', word)
word = re.sub(ur'é', ur'e', word)
....

I've been trying to figure this out but haven't had any luck. Any help is appreciated!

Are you sure you want regex? This sounds like a job for replace() — Gillespie
– Gillespie, Commented Dec 8, 2014 at 23:16

chapelo · Accepted Answer · 2014-12-08 23:55:00Z

4

Try with str.translate and maketrans...

print('thójlà'.translate(str.maketrans('óúéáíùú', 'oueaiuu')))
# thojlà

This way you ensure the only substitutions you want to make.

If you had many strings to change, you should assign your maketrans to a variable, like

table = str.maketrans('óúéáíùú', 'oueaiuu')

and then, each string can be translated as

s.translate(table)

edited Dec 8, 2014 at 23:55

answered Dec 8, 2014 at 23:39

chapelo

2,56216 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

twasbrillig Over a year ago

Nice. "There should be one-- and preferably only one --obvious way to do it." and this is it.

twasbrillig Over a year ago

Note, the code above is for Python 3. In Python 2 it's string, not str: print 'thójlà'.translate(string.maketrans('óúéáíùú', 'oueaiuu'))

Gillespie · Accepted Answer · 2014-12-09 01:18:01Z

3

With String's replace() function you can do something like:

x = "thójlà"                  
>>> x
'thójlà'
>>> x = x.replace('ó','o')
'thojlà'
>>> x = x.replace('à','a')
'thojla'

A generalized way:

# -*- coding: utf-8 -*-

replace_dict = {
    'á':'a',
    'à':'a',
    'é':'e',
    'í':'i',
    'ó':'o',
    'ù':'u',
    'ú':'u'
}

str1 = "thójlà"

for key in replace_dict:
    str1 = str1.replace(key, replace_dict[key])

print(str1) #prints 'thojla'

A third way, if your list of character mappings is getting too large:

# -*- coding: utf-8 -*-

replace_dict = {
    'a':['á','à'],
    'e':['é'],
    'i':['í'],
    'o':['ó'],
    'u':['ù','ú']
}

str1 = "thójlà"

for key, values in replace_dict.items():
    for character in values:
        str1 = str1.replace(character, key)

print(str1)

edited Dec 9, 2014 at 1:18

answered Dec 8, 2014 at 23:18

Gillespie

6,6263 gold badges38 silver badges71 bronze badges

6 Comments

user2743 Over a year ago

is there a way that I can do this without having to create a statement for each character? I could have done that with re.sub but that's what I want to avoid in case the list of characters to be changed becomes large. Thanks for the help!

Gillespie Over a year ago

I also added a third method.

twasbrillig Over a year ago

@RPGillespie, to make it a bit more efficient, you can do: for key, values in replace_dict:, for character in values:

Matt Coubrough Over a year ago

the dictionary replace technique could be pretty slow if there are a lot of replacement characters and/or the replacement text is long.

Gillespie Over a year ago

@twasbrillig I never realized you could do that!

|

twasbrillig · Accepted Answer · 2014-12-08 23:53:48Z

1

If you can use external packages, the easiest way, i think, would be using unidecode. For example:

from unidecode import unidecode

print(unidecode('thójlà'))
# prints: thojla

edited Dec 8, 2014 at 23:53

twasbrillig

19.2k9 gold badges47 silver badges71 bronze badges

answered Dec 8, 2014 at 23:17

Marcin

241k16 gold badges315 silver badges368 bronze badges

3 Comments

user2743 Over a year ago

what if I have other unicode characters that I don't want to substitute will those characters be affected? Thanks for the help!

Gillespie Over a year ago

Yes, all non-ascii characters will be transliterated.

Marcin Over a year ago

Maybe there are some options to specify which characters are changed and which not. Dont know about this. I assumed yo want to change everting.

Collectives™ on Stack Overflow

Python regex statement

3 Answers 3

2 Comments

6 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

6 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related