2

I'm trying to replace some adresses in French in a dataframe. I'm using a list and regular expressions and one list.

def adresses(df):  

    liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV', 'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau', 'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie', 'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']

    for i in liste_adresses:

        df['C'] = df['C'].str.replace(r'[0-9]+(,|\s+)i\s+\w+\s+(\w+)?(\s+)?(\w+)?(\s+)?([0-9]{5})?(\s+)?\w+?([0-9]{5})?','<address>')

return df

My dataframe:

       A          B                                                                C
  French      house                      I live in 15 rue Louis Philippe 75001 Neuilly
 English      house               my address: 101-102 bd Charles de Gaulle 75001 Paris
  French  apartment                                                    my name is Liam
  French      house                                                       Hello George!
 English  apartment  This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it

On my output, nothing happens.

Good output:

       A          B                         C
  French      house                                I live in <address>
 English      house                              my address: <address>
  French  apartment                                    my name is Liam
  French      house                                       Hello George!
 English  apartment  This is wrong: <address> and I'm not happy with it
5
  • 2
    The problem about nothing happens is that the variable i that contains the elements of liste_adresses is embedded in the regex you define '[0-9]+(,|\s+)i\s+\...' so it is looking for the letter i not its value (for example 'allée'). It would be more: '[0-9]+(,|\s+)' + i + '\s+\...' and then something happens although it is not the expected output. Commented Nov 23, 2018 at 14:35
  • 2
    In your full data, does the strings in the column C ends by the address? By this, I mean could have more character after such as This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it? Commented Nov 23, 2018 at 14:38
  • 2
    @Ben.T Not necessarily, I'll edit my dataframe. Thank you Commented Nov 23, 2018 at 14:39
  • 1
    ok, then it becomes a more difficult task to my opinion. Would you have a list of cities a bit the same way than liste_adresses ? or you have to many cities in you data? Commented Nov 23, 2018 at 14:52
  • Many cities in my data :( Commented Nov 23, 2018 at 14:54

1 Answer 1

3

The following solution may not works for specific cases. Because the end of the address is either the postal code or the city that you don't know, I think one way could be to look for:

  1. a string with numbers at first '[0-9]+': all addresses start with a number
  2. some characters (.*): for example to catch -102
  3. any word from liste_adresses using '|'.join(liste_adresses)
  4. the postal code of 5 digits [0-9]{5}
  5. look for the city name if existing with ([^\.|\n]{0,2}[A-Z][a-z]*)*: here I assume that if there is a dot or a new line after the postal code, then the address is over, so match between 0 and 2 characters but not a dot or new line [^\.|\n]{0,2}, then one upper case letter [A-Z] then any lower case [a-z]* until the end of the word, the extra at the end * would catch cities composed of two words like Saint-Denis.

So globally, doing:

liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV',
                  'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau',
                  'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie',
                  'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']

reg = r'[0-9]+(.*)('+'|'.join(liste_adresses) + ')(.*)[0-9]{5}([^\.|\n]{0,2}[A-Z][a-z]*)*'

print (df['C'].str.replace(reg,'<address>'))
0                                  I live in <address>
1                                my address: <address>
2                                      my name is Liam
3                                        Hello George!
4    This is wrong: <address> and I'm not happy wit...
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.