Substring lookup -- Rabin Karp algorithm

Rabin Karp algorithm is a hash based substring search algorithm, which first calculates the hash value of the pattern string, and then uses the same hash function to calculate the split paper of all possible M-character substrings in the text and compare it with the hash value of the pattern string. If they are the same, continue to verify that they match.

Basic idea: the length of M corresponds to a m-digit of R-base,

An example is given to illustrate Rabin Karp algorithm

For example, to find the pattern 26535 in the text 3141592653589793, first select the hash size Q (set as 997 here), use the method of division and reservation remainder, the hash value is 26535% 997 = 613, then calculate the hash value of all strings with length of 5 in the text and find the match.

Key idea: Rabin Karp algorithm is based on the efficient calculation of substring value of i+1 position in text for all positions I. The specific algorithm is: assuming that h (x i) = xi mod q is known, moving the pattern string one bit to the right is equivalent to replacing xi with x(i+1), x(i+1) is equal to xi minus the value of the first number, multiplied by R, plus the value of the last number. The result is that no matter M is 5, 100 or 1000, it can move backward one by one in constant time.

Calculate hash function: for a 5-bit number, you can use int to calculate directly, but if M is equal to 100 or 1000, you can't. You can use the Horner method at this time. The calculation method is as follows:

private long hash(String key, int m) {   
    long h = 0; 
    for (int j = 0; j < m; j++) 
        h = (R * h + key.charAt(j)) % q;
    return h;
}

Find implementation: there are two representative implementations: the Monte Carlo method and the Las Vegas method.

Monte Carlo method is to select a large Q value to minimize the hash conflict, which can ensure that the same hash value is a successful match;

Las Vegas method is to compare characters after the hash values are the same, which is not as efficient as the previous method, but can guarantee the correctness.

public class RabinKarp {
    private String pat;    
    private long patHash;   
    private int m;       
    private long q;       
    private int R;         
    private long RM;      

    public RabinKarp(String pat) {
        this.pat = pat;
        R = 256;
        m = pat.length();
        q = longRandomPrime();

        RM = 1;
        for (int i = 1; i <= m-1; i++)
            RM = (R * RM) % q;
        patHash = hash(pat, m);
    } 

    private long hash(String key, int m) { 
        long h = 0; 
        for (int j = 0; j < m; j++) 
            h = (R * h + key.charAt(j)) % q;
        return h;
    }

    private boolean check(String txt, int i) {
        for (int j = 0; j < m; j++) 
            if (pat.charAt(j) != txt.charAt(i + j)) 
                return false; 
        return true;
    }

    public int search(String txt) {
        int n = txt.length(); 
        if (n < m) return n;
        long txtHash = hash(txt, m); 

        if ((patHash == txtHash) && check(txt, 0))
            return 0;

        for (int i = m; i < n; i++) {
            txtHash = (txtHash + q - RM*txt.charAt(i-m) % q) % q; 
            txtHash = (txtHash*R + txt.charAt(i)) % q; 

            int offset = i - m + 1;
            if ((patHash == txtHash) && check(txt, offset))
                return offset;
        }
        return n;
    }
    
    private static long longRandomPrime() {
        BigInteger prime = BigInteger.probablePrime(31, new Random());
        return prime.longValue();
    }
}

Posted by Naki-BoT on Thu, 30 Apr 2020 01:42:41 -0700