Algorithms Note 13 - String Sorting

Key Index Count
- frequency count
- Convert Frequency to Index
- data classification
- Writeback
Low-bit-first string ordering
High-bit-first string ordering

Many important and familiar issues are string-based, such as information processing (searching web pages, documents based on a given keyword), communication systems (sending text messages, e-mails, downloading e-books), programming systems (programs consisting of strings, then converting strings into machine instructions by compilers or interpreters), genomics (biologists)String processing has become the cornerstone of computational biology research by converting DNA from codons to strings composed of four characters A, C, T, and G. For many sorting applications, the keys that determine the order are strings, and the string key sorting method developed with the special properties of strings is more efficient than the general sorting method you've learned before.

Key Index Count

Key index counting is a simple sorting method for small integer keys and is the basis for the next two string sorting methods. Usually you will encounter many small integer occasions, such as when the teacher statistics student scores, want to classify the whole class into groups, which are generally smaller integers.Key index counting at this time. The key index counting method with the sorting of students as an example is four steps, and the array is as follows

2 Anderson
3 Brown
3 Davis
4 Garcia
1 Harris
3 Jackson
4 Johnson
3 Jones
1 Martin
2 Martinez
2 Miller
1 Moore
2 Robinson
4 Smith
4 Taylor
4 Thomas
2 Thompson
3 White
4 Williams
4 Wilson

frequency count

The first step is to use the int array count to calculate how often each key occurs.For each element in the array, use its key to access the corresponding element in count and add it to 1.For the convenience of later processing, for the key r, the position of r+1 in the array is increased. Since Anderson adds 1 to count[3] in the second group, Brown adds 1 to count[4] in the third group, and the position of count[0] is not used. The value is always 0. In this case, count[1] is always 0, because there is no group 0.

Convert Frequency to Index

The result of the previous count step is then used to calculate the starting index position of each key in the sorted result. There are three people in the first group and five in the second group, so the starting index position of the data in the second group is 3, and the starting index position of the data in the third group is 8... By doing the following, the frequency can be converted to an index, and then the starting index of the R group can be obtained from count[r].This is why the r+1 position was increased in the previous step.

for(int r=0;r<R;r++){
    count[r+1]+=count[r];
}

data classification

After converting the count array to an index table, all the elements are moved to an auxiliary array, aux, where the position of each element is determined by the count value corresponding to its key. For each element transferred, the value of its corresponding position in the count is added by 1. After all the elements are moved to the aux array, the sorting result is generated.

Writeback

Finally, the sorted results in aux are copied back to the original array. With this sort method, the relative order of keys does not change and it is a stable sort method.

Low-bit-first string ordering

Low-priority string sorting is a sort method based on key index counting and is suitable for sorting fixed-length strings, such as phone numbers, bank accounts, IP addresses, license plate numbers, etc. If the length of the strings is W, this method sorts the strings W-pass by key index counting, using the character at each position as the key from right to left.Because the key index counting method is stable, the results can be obtained after W-round sorting.

public class LSD {
    public static void sort(String[] a, int w) {
        int N = a.length;
        int R = 256;
        String[] aux = new String[N];

        for (int d = w - 1; d >= 0; d--) {
            int[] count = new int[R + 1];

            for (int i = 0; i < N; i++)
                count[a[i].charAt(d) + 1]++;

            for (int r = 0; r < R; r++)
                count[r + 1] += count[r];

            for (int i = 0; i < N; i++)
                aux[count[a[i].charAt(d)]++] = a[i];

            for (int i = 0; i < N; i++)
                a[i] = aux[i];

        }
    }
}

High-bit-first string ordering

The low-bit-first string sorting method is only suitable for cases where the length of the string is uniform, but a general string sorting algorithm should be able to handle cases where the length of the string is different.High-bit-first string sorting is one such method, because strings do not necessarily have the same length, so traverse through the characters from left to right, requiring that strings that have been checked for all characters (shorter strings) be preceded by longer strings. A charAt method is encapsulated here that returns -1 when the specified position exceeds the end of the string.Then add all the returned values to 1 to get a series of non-negative integers that are used as indexes for the count array.Since key index counting requires an additional location, specify the size of the count array as R+2 when initializing it here.

public class MSD {
    private static int R = 256;
    private static final int M = 0;
    private static String[] aux;

    private static int charAt(String s, int d) {
        if (d < s.length())
            return s.charAt(d);
        else
            return -1;
    }

    public static void sort(String[] a) {
        int N = a.length;
        aux = new String[N];
        StdOut.println(String.format("sort(a, %s, %s, %s)", 0, N - 1, 0));
        sort(a, 0, N - 1, 0);

    }

    public static int index;

    public static void sort(String[] a, int lo, int hi, int d) {
        index++;
        if (hi <= lo + M) {
            insertion(a, lo, hi, d);
            return;
        }

        int[] count = new int[R + 2];
        for (int i = lo; i <= hi; i++)
            count[charAt(a[i], d) + 2]++;

        for (int r = 0; r < R + 1; r++)
            count[r + 1] += count[r];

        for (int i = lo; i <= hi; i++)
            aux[count[charAt(a[i], d) + 1]++] = a[i];

        for (int i = lo; i <= hi; i++)
            a[i] = aux[i - lo];

        for (int r = 0; r < R; r++) {
            sort(a, lo + count[r], lo + count[r + 1] - 1, d + 1);
        }
    }

    // insertion sort a[lo..hi], starting at dth character
    private static void insertion(String[] a, int lo, int hi, int d) {
        for (int i = lo; i <= hi; i++)
            for (int j = i; j > lo && less(a[j], a[j - 1], d); j--)
                exch(a, j, j - 1);
    }

    // exchange a[i] and a[j]
    private static void exch(String[] a, int i, int j) {
        String temp = a[i];
        a[i] = a[j];
        a[j] = temp;
    }

    // is v less than w, starting at character d
    private static boolean less(String v, String w, int d) {
        // assert v.substring(0, d).equals(w.substring(0, d));
        for (int i = d; i < Math.min(v.length(), w.length()); i++) {
            if (v.charAt(i) < w.charAt(i))
                return true;
            if (v.charAt(i) > w.charAt(i))
                return false;
        }
        return v.length() < w.length();
    }
}

As with quick sorting, high-priority string sorting divides an array into subarrays that can be sorted independently to complete the sorting task, but its slicing yields a subarray for each initial letter instead of a fixed two or three slices as with quick sorting. In addition, to avoid too much recursion with decimal arrays, switch to insert sort when the subarrays are smaller.

Posted by JayBlake on Sat, 04 Jan 2020 06:47:02 -0800

Programmer Group