Finding an unknown pattern in a string python

Question

I am well aware of following question which also exists on stack overflow String Unknown pattern Matching but the answer there doesn't really work for me.

My problem is next. I get a string of characters e.g

'1211' and what I need to do is see that 1 is most often repeated and this 2 times in a row.

But it can also be "121212112" where 12 is repeated 3 times in a row.

But with 12221221 it is 221 that is repeated 2 times rather than 2 that repeats 3 times.

here are some results I like to get (the only numbers ever used are 1 and 2's)

>>> counter('1211')
1
>>> counter('1212')
2
>>> counter('21212')
2

the outcome I want is how many times it occurs.

I have no idea how to even start looking for a pattern since it is not known on forehand and I did some research online and don't find anything similar.

Does anyone have any idea how I even start to tackle this problem ? All help is welcome and if you want more information don't hesitate to let me know.

Can you describe the exact behaviour you want? For example, why is “12121212” to be treated as “1212” twice in a row, rather than as “12” four times or “12121212” once? — Ben
– Ben, Commented Sep 1, 2018 at 13:17
Are you always looking to get the longest substring? Let's take 12121212, can that not be 12 four times? — Paolo
– Paolo, Commented Sep 1, 2018 at 13:24
Also, in order to implement something reasonable, you will have to know some contraints, e.g. the length of the longest expected input string, the longest substring looked for. — zvone
– zvone, Commented Sep 1, 2018 at 13:29
so this is a misunderstanding from my side (i am sorry i fix he question right now ) so if i got 12121221221 i need the 221 as sub string so in that sense i need the longest but indeed when it comes to 12121212 4 times 12 is ok (i am working for and exercise I am sorry for the misunderstanding ) — Jirka
– Jirka, Commented Sep 1, 2018 at 13:40

Jirka · Accepted Answer · 2018-09-01 14:16:46Z

2

Really inefficient, but you can

find all substrings (https://stackoverflow.com/a/22470047/264596)
put them into a set to avoid duplicates
for each of the substring, find all its occurrences - and use some function to find the max (I am not sure how you choose between short strings occurring many times and long strings occurring few times)

Obviously you can use some datastructure to pass through the string once and do some counting on the way, but since I am not sure what your constraints and desired output is, I can give you only this.

answered Sep 1, 2018 at 14:16

Jirka

4,21432 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jirka Over a year ago

what i need if i use the first function you recommended (thanks for that) is that i need the duplicates. because i need to get the longest that is repeated. The idea of a for loops sparks to mind for this but I am not sure if that is a huge detour or not

Jirka Over a year ago

@LibertyVerleysen Yes, you need duplicates, but not in step 2. In step 3, you will get for each unique substring, how many times it repeats and you can choose which one is the best (longer, more common, some balance between those).

Jirka Over a year ago

ok i get that but my only problem now is that i have for example "1211" so 1 is repeated 2 times. but if i count 1 i get 3. is there a elegant way of solving this ?

Schalton · Accepted Answer · 2018-09-02 21:37:14Z

I agree with Jirka, not sure how you score long vs short to select the optimal results but this function will give you the menu:

#Func1
def sub_string_cts(string):
    combos = {}
    for i in range(len(string)):
        u_start = len(string) - i
        for start in range(u_start):
            c_str = string[start:i+start+1]

            if c_str in combos:
                combos[c_str] += 1
            else:
                combos[c_str] = 1
    return combos

sub_string_cts('21212')

{'2': 3,
 '1': 2,
 '21': 2,
 '12': 2,
 '212': 2,
 '121': 1,
 '2121': 1,
 '1212': 1,
 '21212': 1}

After your comment I think this is more what you're looking for:

#Func2
def sub_string_cts(string):
    combos = {}
    for i in range(len(string)):
        u_start = len(string) - i
        substrs = set([string[start:i+start+1] for start in range(u_start)])
        for substring in substrs:
            combos[substring]  = max([len(i) for i in re.findall("((?:{})+)".format(substring), string)])//len(substring)

    return combos

sub_string_cts('21212')

{'2': 1,
 '1': 1,
 '21': 2,
 '12': 2,
 '212': 1,
 '121': 1,
 '2121': 1,
 '1212': 1,
 '21212': 1}

You could narrow that down to the 'best' candidates by collapsing on the highest occuring instance of each string length:

def max_by_len(result_dict):
    results = {}
    for k, v in result_dict.items():
        if len(k) not in results:
            results[len(k)] = {}

    for c_len in [ln for ln in results]:
        len_max_count = max([v for (k, v) in result_dict.items() if len(k) == c_len])
        for k,v in result_dict.items():
            if len(k) == c_len:
                if v == len_max_count:
                    results[c_len][k] = v
    return results

#Func1:
max_by_len(sub_string_cts('21212'))

{1: {'2': 3},
 2: {'21': 2, '12': 2},
 3: {'212': 2},
 4: {'2121': 1, '1212': 1},
 5: {'21212': 1}}

#Func2:
max_by_len(sub_string_cts('21212'))

{1: {'2': 1, '1': 1},
 2: {'21': 2, '12': 2},
 3: {'212': 1, '121': 1},
 4: {'2121': 1, '1212': 1},
 5: {'21212': 1}}

Assuming we wouldn't select '2121' or '1212' because their occurrence matches '21212' and they're shorter in length, and that similarly we wouldn't select '21' or '12' as they occur at the same frequency as '212' we could limit our viable candidates down to '2', '212', and '21212' with the following code:

def remove_lesser_patterns(result_dict):
    len_lst = sorted([k for k in result_dict], reverse=True)
    #len_lst = sorted([k for k in max_len_results])
    len_crosswalk = {i_len: max([v for (k,v) in result_dict[i_len].items()]) for i_len in len_lst}

    for i_len in len_lst[:-1]:
        eval_lst = [i for i in len_lst if i < i_len]

        for i in eval_lst:
            if len_crosswalk[i] <= len_crosswalk[i_len]:
                if i in result_dict:
                    del result_dict[i]
    return result_dict

#Func1
remove_lesser_patterns(max_by_len(sub_string_cts('21212')))

{1: {'2': 3}, 3: {'212': 2}, 5: {'21212': 1}}

#Func2
remove_lesser_patterns(max_by_len(sub_string_cts('21212')))

{2: {'21': 2, '12': 2}, 5: {'21212': 1}}

results:

test_string = ["1211", "1212", "21212", "12221221"]
for string in test_string:
    print("<Input: '{}'".format(string))
    c_answer = remove_lesser_patterns(max_by_len(sub_string_cts(string)))
    print("<Output: {}\n".format(c_answer))

<Input: '1211'
<Output: {1: {'1': 2}, 4: {'1211': 1}}
# '1' is repeated twice

<Input: '1212'
<Output: {2: {'12': 2}, 4: {'1212': 1}}
# '12' is repeated twice

<Input: '21212'
<Output: {2: {'21': 2, '12': 2}, 5: {'21212': 1}}
# '21' and '12' are both repeated twice

<Input: '12221221'
<Output: {1: {'2': 3}, 3: {'221': 2}, 8: {'12221221': 1}}
# '2' is repeated 3 times, '221' is repeated twice

These functions together give you the highest occurrence of each pattern by length. The key for the dictionary is the length, with a sub-dictionary of the highest (multiple if tied) occuring patterns.

Func2 requires the patterns be sequential, whereas Func1 does not -- it is strictly occurrence based.

Note:

With your example:

3. But with 12221221 it is 221 that is repeated 2 times rather than 2 that repeats 3 times.

the code solves this ambiguity in your desired output (2 or 3) by giving you both:

<Input: '12221221'
<Output: {1: {'2': 3}, 3: {'221': 2}, 8: {'12221221': 1}}
# '2' is repeated 3 times, '221' is repeated twice

If you're only interested in the 2 char lengths you can easily pull those out of the max_by_len results as follows:

test_string = ["1211", "1212", "21212", "12221221"]
for string in test_string:
    print("<Input: '{}'".format(string))
    c_answer = remove_lesser_patterns({k:v for (k,v) in max_by_len(sub_string_cts(string)).items() if k == 2})
    print("<Output: {}\n".format(max([v for (k,v) in c_answer[2].items()])))
#Func2
<Input: '1211'
<Output: 1

<Input: '1212'
<Output: 2

<Input: '21212'
<Output: 2

<Input: '12221221'
<Output: 1

I think i fail to express myself. i only need the patterns which follow each other . since is never followed by another this can never be a pattern. As for the function all i got is following assignment "Write a function that takes a string. The function must return the length of the longest doubled suffix of this string."
updated, I think func2 using regex is your desired output @LibertyVerleysen
At this point I've answered ever combination of the ambiguous things in your question. #AcceptIt

Collectives™ on Stack Overflow

Finding an unknown pattern in a string python

2 Answers 2

3 Comments

Note:

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Note:

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related