[excellent research question] [rmq] [suffix array] Maximum repetition substring POJ3693

Keywords: Algorithm string RMQ

The repetition number of a string is defined as the maximum number R such that the string can be partitioned into R same consecutive substrings. For example, the repetition number of "ababab" is 3 and "ababa" is 1.

Given a string containing lowercase letters, you are to find a substring of it with maximum repetition number.

Input

The input consists of multiple test cases. Each test case contains exactly one line, which
gives a non-empty string consisting of lowercase letters. The length of the string will not be greater than 100,000.

The last test case is followed by a line containing a '#'.

Output

For each test case, print a line containing the test case number( beginning with 1) followed by the substring of maximum repetition number. If there are multiple substrings of maximum repetition number, print the lexicographically smallest one.

Sample Input

ccabababc
daabbccaa
#

Sample Output

Case 1: ababab
Case 2: aa

Meaning:   Find the substring that repeats the most times in the string. For example, in example 1, ab appears three times in the substring ababab, so ababab is the substring with the most occurrences. In example 2, a appears twice in the substring aa, so aa is also the substring with the most occurrences. If two substrings appear repeatedly the same number of times, the substring with the smallest dictionary order is output.

analysis:   The classic application of suffix array only considers the substring whose number of occurrences of cyclic section is greater than 1. You can enumerate the length of cyclic section. For a certain length len, you also need to enumerate the location i of cyclic section, and you can't enumerate directly from 1 to n. you need to enumerate at an interval of len locations, that is, consider the locations of 1, 1+len, 1+2*len... Otherwise it will timeout. The longest common prefix of suffix i and suffix i+len divided by len and then + 1 is the number of occurrences of cyclic sections with a length of len starting from i. However, it is obvious that this cannot correctly find the maximum number of occurrences. It is also necessary to consider the case that the longest common prefix divided by Len has a remainder, such as xbcabcab. When len is 3, the string takes 1 for the first time. Compare xbcabcab and abcab, It is found that there is no remainder. The quotient + 1 divided by is the number of occurrences of the cyclic section with len of 3 and start position of 1. Take the position of 4 for the second time. Compare the suffixes abcab and ab and find that the remainder is 2. Then the start position can be pushed forward, because it is possible that the cyclic section of a substring starting from the front has a greater number of occurrences. In this example, the cyclic section can be regarded as cab, And push the starting position to 3, and then compare the two suffixes cabcab and cab. It is found that the answer will indeed be updated.

After determining the maximum number of circular sections, you only need to output them in dictionary order. In the above process, you can record the length of each circular section that reaches the maximum number of circular sections, and then run the cycle again from the sa array. Since the sa array is already arranged in dictionary order, you can exit and output the answer as long as you find the first satisfactory solution.

The specific codes are as follows:  

#include <iostream>
#include <cstdio>
#include <cstring>
#include <utility>
#include <cmath>
using namespace std;
//The complexity is n+n/2+n/3+...+1=nlogn 
const int maxn = 1e6+10;
int n, m;
char s[maxn];
int sa[maxn], height[maxn], x[maxn], y[maxn], rk[maxn], tong[maxn], a[maxn][20], stack[maxn], top;
void get_sa()
{
	for(int i = 0; i <= m; i++) tong[i] = 0;
	for(int i = 0; i <= 2*n; i++) y[i] = x[i] = 0;
    for(int i = 1; i <= n; i++) tong[x[i] = s[i]] ++;
    for(int i = 2; i <= m; i++) tong[i] += tong[i-1];
    for(int i = n; i; i--) sa[tong[x[i]]--] = i;
    for(int k = 1; k <= n; k <<= 1) 
	{
        int num = 0;
        for(int i = n-k+1; i <= n; i++) y[++num] = i;
        for(int i = 1; i <= n; i++) 
		{
            if(sa[i] <= k) continue;
            y[++num] = sa[i] - k;
        }
        for(int i = 0; i <= m; i++) tong[i] = 0;
        for(int i = 1; i <= n; i++) tong[x[i]]++;
        for(int i = 2; i <= m; i++) tong[i] += tong[i-1];
        for(int i = n; i; i--) sa[tong[x[y[i]]]--] = y[i], y[i] = 0;
    	for(int i = 0; i <= 2*num; i++)
    	{
    		int temp = x[i];
    		x[i] = y[i];
    		y[i] = temp;
		}
        x[sa[1]] = 1, num = 1;
        for(int i = 2; i <= n; i++) 
            x[sa[i]] = (y[sa[i]] == y[sa[i-1]] && y[sa[i] + k] == y[sa[i-1] + k]) == 1 ? num : ++ num;
        if(n == num) return;
        m = num;
    }
}

void get_height() 
{
    for(int i = 1; i <= n; i++) rk[sa[i]] = i;
    for(int i = 1, k = 0; i <= n; i++) 
	{
        if(rk[i] == 1) continue;
        if(k) k--;
        int j = sa[rk[i]-1];
        while(i + k <= n && j + k <= n && s[i+k] == s[j+k]) k++;
        height[rk[i]] = k;
    }
}

void st()
{
	for(int i = 1; i <= n; i++)
		a[i][0] = height[i];
	for(int j = 1; j < 20; j++)
		for(int i = 1; i+(1<<j)-1 <= n; i++)
			a[i][j] = min(a[i][j-1], a[i+(1<<(j-1))][j-1]);
}

int query(int l, int r)
{
	if(l > r)
		swap(l, r);
	l++;
	int t = log(r-l+1)/log(2);
	return min(a[l][t], a[r-(1<<t)+1][t]);
}

void solve(int &res) 
{ 
    get_sa();
    get_height();
    st();
//    for(int i = 1; i <= n; i++)
//    	cout << height[i] << ' ';
//    putchar('\n');
	//You can only find the maximum number of times correctly, not all the answers 
    int times = 0;//Record the maximum number of occurrences of the loop section 
    top = 0;
    for(int i = 1; i <= n; i++)//Enumeration loop section length
	{
		for(int j = 1; j+i <= n; j+=i)//Enumeration substring start position 
		{
			int t = query(rk[j], rk[j+i]);//Maximum common prefix length 
			int ans = t/i;
			if(t%i)//There is remainder, and there may be a better solution 
			{
				int pos = j-(i-t%i);
				if(pos >= 1 && query(rk[pos], rk[pos+i]) >= i)
					ans++;
			}
			if(ans > times)
			{
				times = ans;
				top = 0;
				stack[++top] = i;
			}
			if(ans == times)
				stack[++top] = i;
		}
	} 
	int len = -1;
	for(int i = 1; i <= n; i++)//Traverse in dictionary order
	{
		for(int j = 1; j <= top; j++)
		{
			if(sa[i]+stack[j]*(times+1)-1 <= n)
			{
				if(query(i, rk[sa[i]+stack[j]]) >= stack[j]*times)
				{
					res = sa[i];
					len = stack[j]; 
					break;
				}
			}
		}
		if(len != -1)
			break;
	}
	times++;
	if(times == 1)//The above is only for times > = 2, so special judgment is required for times=1 
	{
		char ch = 'z';
		for(int i = 1; i <= n; i++)
		{
			if(ch >= s[i])
			{
				ch = s[i];
				res = i;
			}
		}
		s[res+1] = '\0';
	}
	else
		s[res+len*times] = '\0';
}

signed main()
{
	int cnt = 0;
	while(~scanf("%s", s+1)) 
	{
		if(s[1] == '#')
			break;
		n = strlen(s+1); 
		m = 200;//Initialization is required every time! 
		int res;
		solve(res);
		printf("Case %d: %s\n", ++cnt, s+res); 
	}
    return 0;
}

Posted by ereptur on Fri, 22 Oct 2021 01:05:13 -0700