3

I have a large text and the aim is to select all 10-character strings for which the first character is a letter and the last character is a digit.

I am a python rookie and what I managed to achieve is to find all 10-character strings:

ten_char = re.findall(r"\D(\w{10})\D", pdfdoc)

Question is how can I put together my other conditions: apart from a 10-character string, I am looking for one where the first character is a letter and the last character is a digit.

Suggestions appreciated!

1
  • You can use [A-Za-z] and [0-9] to tell it the character at this position should be an alphabetical character or a digit. Commented Sep 9, 2016 at 21:55

4 Answers 4

2

([a-z].{8}[0-9])

Will ask for 1 alphabetical char, 8 other character and finally 1 number.

JS Demo

var re = /([a-z].{8}[0-9])/gi; 
var str = 'Aasdf23423423423423423b423423423423423';
var m;
 
while ((m = re.exec(str)) !== null) {
    if (m.index === re.lastIndex) {
        re.lastIndex++;
    }
     console.log(m[0]);
}

https://regex101.com/r/gI8jZ4/1

Sign up to request clarification or add additional context in comments.

3 Comments

Don't use [.] in the middle. It can match whitespace. Use \w.
You might want to use \w for all non-whitespace characters, or [a-zA-Z] to include capitalized alphanumerics -- and don't forget about non-ASCII.
@DawidGrabowski and Christian, I can edit it but that would only include any letter, number or underscore. and I don't see that reflecting the question at this moment. Could you please ellaborate?
1

If I understand it, do:

r'\b([a-zA-Z]\S{8}\d)\b'

Demo

Python demo:

>>> import re
>>> txt="""\
... Should match:
... a123456789 aA34567s89 zzzzzzzer9
... 
... Not match:
... 1123456789 aA34567s8a zzzzzzer9 zzzxzzzze99"""
>>> re.findall(r'\b([a-zA-Z]\S{8}\d)\b', txt)
['a123456789', 'aA34567s89', 'zzzzzzzer9']

Comments

0

I wouldn't use regex for this. Regular string manipulation is more clear in my opinion (though I haven't tested the following code).

def get_useful_words(filename):
    with open(filename, 'r') as file:
        for line in file:
            for word in line.split():
                if len(word) == 10 and word[0].isalpha() and word[-1].isdigit():
                    yield word


for useful_word in get_useful_words('tmp.txt'):
    print(useful_word)

3 Comments

@DawidGrabowski Could you please explain the inefficiencies? I'm not caching a regular expression, but I'm also not reading the whole file into memory at one time. The question specified a large text file.
Memory wise I agree but that's just gonna be slower comparing to regex.
Personally, I don't find this more clear. This seems like the perfect use for a regular expression.
0

thank you very much for a great discussion and interesting suggestions. Very first post on stack overflow, but wow...what a community you are!

In fact, using:

r'\b([a-zA-Z]\S{8}\d)'

solved my problem very nicely. Really appreciated all your comments.

1 Comment

Be sure to use r'\b([a-zA-Z]\S{8}\d)\b' or you will also match words longer than 10 characters that have a matching prefix...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.