3

I'd like to create a regex statement in Python 2.7.8 that will substitute characters. It will work like this...

ó -> o
ú -> u
é -> e
á -> a
í -> i
ù,ú  -> u

These are the only unicode characters that I would like to change. Such unicode characters as, ë, ä I don't want to change. So the word, thójlà will become tholja. I'm sure there is a way so that I don't have to create all the regex separately like below.

word = re.sub(ur'ó', ur'o', word)
word = re.sub(ur'ú', ur'u', word)
word = re.sub(ur'é', ur'e', word)
....

I've been trying to figure this out but haven't had any luck. Any help is appreciated!

2
  • 1
    Are you sure you want regex? This sounds like a job for replace() Commented Dec 8, 2014 at 23:16
  • 1
    You are not changing the à, the result should be tholjà Commented Dec 8, 2014 at 23:45

3 Answers 3

4

Try with str.translate and maketrans...

print('thójlà'.translate(str.maketrans('óúéáíùú', 'oueaiuu')))
# thojlà

This way you ensure the only substitutions you want to make.

If you had many strings to change, you should assign your maketrans to a variable, like

table = str.maketrans('óúéáíùú', 'oueaiuu')

and then, each string can be translated as

s.translate(table)
Sign up to request clarification or add additional context in comments.

2 Comments

Nice. "There should be one-- and preferably only one --obvious way to do it." and this is it.
Note, the code above is for Python 3. In Python 2 it's string, not str: print 'thójlà'.translate(string.maketrans('óúéáíùú', 'oueaiuu'))
3

With String's replace() function you can do something like:

x = "thójlà"                  
>>> x
'thójlà'
>>> x = x.replace('ó','o')
'thojlà'
>>> x = x.replace('à','a')
'thojla'

A generalized way:

# -*- coding: utf-8 -*-

replace_dict = {
    'á':'a',
    'à':'a',
    'é':'e',
    'í':'i',
    'ó':'o',
    'ù':'u',
    'ú':'u'
}

str1 = "thójlà"

for key in replace_dict:
    str1 = str1.replace(key, replace_dict[key])

print(str1) #prints 'thojla'

A third way, if your list of character mappings is getting too large:

# -*- coding: utf-8 -*-

replace_dict = {
    'a':['á','à'],
    'e':['é'],
    'i':['í'],
    'o':['ó'],
    'u':['ù','ú']
}

str1 = "thójlà"

for key, values in replace_dict.items():
    for character in values:
        str1 = str1.replace(character, key)

print(str1)

6 Comments

is there a way that I can do this without having to create a statement for each character? I could have done that with re.sub but that's what I want to avoid in case the list of characters to be changed becomes large. Thanks for the help!
I also added a third method.
@RPGillespie, to make it a bit more efficient, you can do: for key, values in replace_dict:, for character in values:
the dictionary replace technique could be pretty slow if there are a lot of replacement characters and/or the replacement text is long.
@twasbrillig I never realized you could do that!
|
1

If you can use external packages, the easiest way, i think, would be using unidecode. For example:

from unidecode import unidecode

print(unidecode('thójlà'))
# prints: thojla

3 Comments

what if I have other unicode characters that I don't want to substitute will those characters be affected? Thanks for the help!
Yes, all non-ascii characters will be transliterated.
Maybe there are some options to specify which characters are changed and which not. Dont know about this. I assumed yo want to change everting.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.