1

I am well aware of following question which also exists on stack overflow String Unknown pattern Matching but the answer there doesn't really work for me.

My problem is next. I get a string of characters e.g

  1. '1211' and what I need to do is see that 1 is most often repeated and this 2 times in a row.
  2. But it can also be "121212112" where 12 is repeated 3 times in a row.
  3. But with 12221221 it is 221 that is repeated 2 times rather than 2 that repeats 3 times.

here are some results I like to get (the only numbers ever used are 1 and 2's)

>>> counter('1211')
1
>>> counter('1212')
2
>>> counter('21212')
2

the outcome I want is how many times it occurs.

I have no idea how to even start looking for a pattern since it is not known on forehand and I did some research online and don't find anything similar.

Does anyone have any idea how I even start to tackle this problem ? All help is welcome and if you want more information don't hesitate to let me know.

9
  • 1
    Can you describe the exact behaviour you want? For example, why is “12121212” to be treated as “1212” twice in a row, rather than as “12” four times or “12121212” once? Commented Sep 1, 2018 at 13:17
  • 1
    cs.stackexchange.com/a/79187/62807 might help you? Commented Sep 1, 2018 at 13:19
  • Are you always looking to get the longest substring? Let's take 12121212, can that not be 12 four times? Commented Sep 1, 2018 at 13:24
  • Also, in order to implement something reasonable, you will have to know some contraints, e.g. the length of the longest expected input string, the longest substring looked for. Commented Sep 1, 2018 at 13:29
  • so this is a misunderstanding from my side (i am sorry i fix he question right now ) so if i got 12121221221 i need the 221 as sub string so in that sense i need the longest but indeed when it comes to 12121212 4 times 12 is ok (i am working for and exercise I am sorry for the misunderstanding ) Commented Sep 1, 2018 at 13:40

2 Answers 2

2

Really inefficient, but you can

  1. find all substrings (https://stackoverflow.com/a/22470047/264596)
  2. put them into a set to avoid duplicates
  3. for each of the substring, find all its occurrences - and use some function to find the max (I am not sure how you choose between short strings occurring many times and long strings occurring few times)

Obviously you can use some datastructure to pass through the string once and do some counting on the way, but since I am not sure what your constraints and desired output is, I can give you only this.

Sign up to request clarification or add additional context in comments.

3 Comments

what i need if i use the first function you recommended (thanks for that) is that i need the duplicates. because i need to get the longest that is repeated. The idea of a for loops sparks to mind for this but I am not sure if that is a huge detour or not
@LibertyVerleysen Yes, you need duplicates, but not in step 2. In step 3, you will get for each unique substring, how many times it repeats and you can choose which one is the best (longer, more common, some balance between those).
ok i get that but my only problem now is that i have for example "1211" so 1 is repeated 2 times. but if i count 1 i get 3. is there a elegant way of solving this ?
0

I agree with Jirka, not sure how you score long vs short to select the optimal results but this function will give you the menu:

#Func1
def sub_string_cts(string):
    combos = {}
    for i in range(len(string)):
        u_start = len(string) - i
        for start in range(u_start):
            c_str = string[start:i+start+1]

            if c_str in combos:
                combos[c_str] += 1
            else:
                combos[c_str] = 1
    return combos

sub_string_cts('21212')

{'2': 3,
 '1': 2,
 '21': 2,
 '12': 2,
 '212': 2,
 '121': 1,
 '2121': 1,
 '1212': 1,
 '21212': 1}

After your comment I think this is more what you're looking for:

#Func2
def sub_string_cts(string):
    combos = {}
    for i in range(len(string)):
        u_start = len(string) - i
        substrs = set([string[start:i+start+1] for start in range(u_start)])
        for substring in substrs:
            combos[substring]  = max([len(i) for i in re.findall("((?:{})+)".format(substring), string)])//len(substring)

    return combos

sub_string_cts('21212')

{'2': 1,
 '1': 1,
 '21': 2,
 '12': 2,
 '212': 1,
 '121': 1,
 '2121': 1,
 '1212': 1,
 '21212': 1}

You could narrow that down to the 'best' candidates by collapsing on the highest occuring instance of each string length:

def max_by_len(result_dict):
    results = {}
    for k, v in result_dict.items():
        if len(k) not in results:
            results[len(k)] = {}

    for c_len in [ln for ln in results]:
        len_max_count = max([v for (k, v) in result_dict.items() if len(k) == c_len])
        for k,v in result_dict.items():
            if len(k) == c_len:
                if v == len_max_count:
                    results[c_len][k] = v
    return results

#Func1:
max_by_len(sub_string_cts('21212'))

{1: {'2': 3},
 2: {'21': 2, '12': 2},
 3: {'212': 2},
 4: {'2121': 1, '1212': 1},
 5: {'21212': 1}}

#Func2:
max_by_len(sub_string_cts('21212'))

{1: {'2': 1, '1': 1},
 2: {'21': 2, '12': 2},
 3: {'212': 1, '121': 1},
 4: {'2121': 1, '1212': 1},
 5: {'21212': 1}}

Assuming we wouldn't select '2121' or '1212' because their occurrence matches '21212' and they're shorter in length, and that similarly we wouldn't select '21' or '12' as they occur at the same frequency as '212' we could limit our viable candidates down to '2', '212', and '21212' with the following code:

def remove_lesser_patterns(result_dict):
    len_lst = sorted([k for k in result_dict], reverse=True)
    #len_lst = sorted([k for k in max_len_results])
    len_crosswalk = {i_len: max([v for (k,v) in result_dict[i_len].items()]) for i_len in len_lst}

    for i_len in len_lst[:-1]:
        eval_lst = [i for i in len_lst if i < i_len]

        for i in eval_lst:
            if len_crosswalk[i] <= len_crosswalk[i_len]:
                if i in result_dict:
                    del result_dict[i]
    return result_dict

#Func1
remove_lesser_patterns(max_by_len(sub_string_cts('21212')))

{1: {'2': 3}, 3: {'212': 2}, 5: {'21212': 1}}

#Func2
remove_lesser_patterns(max_by_len(sub_string_cts('21212')))

{2: {'21': 2, '12': 2}, 5: {'21212': 1}}

results:

test_string = ["1211", "1212", "21212", "12221221"]
for string in test_string:
    print("<Input: '{}'".format(string))
    c_answer = remove_lesser_patterns(max_by_len(sub_string_cts(string)))
    print("<Output: {}\n".format(c_answer))

<Input: '1211'
<Output: {1: {'1': 2}, 4: {'1211': 1}}
# '1' is repeated twice

<Input: '1212'
<Output: {2: {'12': 2}, 4: {'1212': 1}}
# '12' is repeated twice

<Input: '21212'
<Output: {2: {'21': 2, '12': 2}, 5: {'21212': 1}}
# '21' and '12' are both repeated twice

<Input: '12221221'
<Output: {1: {'2': 3}, 3: {'221': 2}, 8: {'12221221': 1}}
# '2' is repeated 3 times, '221' is repeated twice

These functions together give you the highest occurrence of each pattern by length. The key for the dictionary is the length, with a sub-dictionary of the highest (multiple if tied) occuring patterns.

Func2 requires the patterns be sequential, whereas Func1 does not -- it is strictly occurrence based.


Note:

With your example:

3. But with 12221221 it is 221 that is repeated 2 times rather than 2 that repeats 3 times.

the code solves this ambiguity in your desired output (2 or 3) by giving you both:

<Input: '12221221'
<Output: {1: {'2': 3}, 3: {'221': 2}, 8: {'12221221': 1}}
# '2' is repeated 3 times, '221' is repeated twice

If you're only interested in the 2 char lengths you can easily pull those out of the max_by_len results as follows:

test_string = ["1211", "1212", "21212", "12221221"]
for string in test_string:
    print("<Input: '{}'".format(string))
    c_answer = remove_lesser_patterns({k:v for (k,v) in max_by_len(sub_string_cts(string)).items() if k == 2})
    print("<Output: {}\n".format(max([v for (k,v) in c_answer[2].items()])))
#Func2
<Input: '1211'
<Output: 1

<Input: '1212'
<Output: 2

<Input: '21212'
<Output: 2

<Input: '12221221'
<Output: 1

5 Comments

I think i fail to express myself. i only need the patterns which follow each other . since is never followed by another this can never be a pattern. As for the function all i got is following assignment "Write a function that takes a string. The function must return the length of the longest doubled suffix of this string."
So it must be consecutive instances?
updated, I think func2 using regex is your desired output @LibertyVerleysen
does "double suffix" mean 2 characters?
At this point I've answered ever combination of the ambiguous things in your question. #AcceptIt

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.