Python - Replace strings in a data frame

Question

I'm trying to replace some adresses in French in a dataframe. I'm using a list and regular expressions and one list.

def adresses(df):  

    liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV', 'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau', 'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie', 'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']

    for i in liste_adresses:

        df['C'] = df['C'].str.replace(r'[0-9]+(,|\s+)i\s+\w+\s+(\w+)?(\s+)?(\w+)?(\s+)?([0-9]{5})?(\s+)?\w+?([0-9]{5})?','<address>')

return df

My dataframe:

       A          B                                                                C
  French      house                      I live in 15 rue Louis Philippe 75001 Neuilly
 English      house               my address: 101-102 bd Charles de Gaulle 75001 Paris
  French  apartment                                                    my name is Liam
  French      house                                                       Hello George!
 English  apartment  This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it

On my output, nothing happens.

Good output:

       A          B                         C
  French      house                                I live in <address>
 English      house                              my address: <address>
  French  apartment                                    my name is Liam
  French      house                                       Hello George!
 English  apartment  This is wrong: <address> and I'm not happy with it

The problem about nothing happens is that the variable i that contains the elements of liste_adresses is embedded in the regex you define '[0-9]+(,|\s+)i\s+\...' so it is looking for the letter i not its value (for example 'allée'). It would be more: '[0-9]+(,|\s+)' + i + '\s+\...' and then something happens although it is not the expected output. — Ben.T
– Ben.T, Commented Nov 23, 2018 at 14:35
In your full data, does the strings in the column C ends by the address? By this, I mean could have more character after such as This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it? — Ben.T
– Ben.T, Commented Nov 23, 2018 at 14:38
ok, then it becomes a more difficult task to my opinion. Would you have a list of cities a bit the same way than liste_adresses ? or you have to many cities in you data? — Ben.T
– Ben.T, Commented Nov 23, 2018 at 14:52

Ben.T · Accepted Answer · 2018-11-23 16:05:23Z

The following solution may not works for specific cases. Because the end of the address is either the postal code or the city that you don't know, I think one way could be to look for:

a string with numbers at first '[0-9]+': all addresses start with a number
some characters (.*): for example to catch -102
any word from liste_adresses using '|'.join(liste_adresses)
the postal code of 5 digits [0-9]{5}
look for the city name if existing with ([^\.|\n]{0,2}[A-Z][a-z]*)*: here I assume that if there is a dot or a new line after the postal code, then the address is over, so match between 0 and 2 characters but not a dot or new line [^\.|\n]{0,2}, then one upper case letter [A-Z] then any lower case [a-z]* until the end of the word, the extra at the end * would catch cities composed of two words like Saint-Denis.

So globally, doing:

liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV',
                  'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau',
                  'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie',
                  'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']

reg = r'[0-9]+(.*)('+'|'.join(liste_adresses) + ')(.*)[0-9]{5}([^\.|\n]{0,2}[A-Z][a-z]*)*'

print (df['C'].str.replace(reg,'<address>'))
0                                  I live in <address>
1                                my address: <address>
2                                      my name is Liam
3                                        Hello George!
4    This is wrong: <address> and I'm not happy wit...

Collectives™ on Stack Overflow

Python - Replace strings in a data frame

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related