PKU data structure and algorithm -- external sorting

Keywords: Algorithm data structure

Computer memory

External memory characteristics

Advantages: permanent storage capacity, portability

Disadvantages: long access time

Principle: minimize the number of accesses to external memory

External memory data access mode

It is divided into two stages: location and access

External memory is divided into fixed length storage space

The data access of external memory is carried out in blocks, so as to reduce the positioning times of external memory and the time consumption of external memory reading and writing

Document organization and management

A file is a data structure stored in external memory and a collection of a large number of records with the same nature. Records are data blocks with independent logical meaning and basic data units

  • By record type

    • Operating system files

      A continuous sequence of characters with no obvious structure

    • Database file

      A structured set of records. Each record consists of one or more data items. Each data item is a basic data unit that cannot be subdivided

  • According to the length of recorded information

    • Fixed length file

    • Variable length file

Document organization

The operating system organizes data in the form of files and completes the mapping from the logical structure of files to the physical structure of external memory

  • File organization logic

    Fixed length record, variable length record and key access record

  • file physical structure

    Sequential file, hash file, index file, inverted file

C + + file stream

File stream is a data stream that takes external storage files as input and output objects

Including istream,ostream,iostream and fstream, ifstream and OFSTREAM

External sorting

The data file in external memory is divided into several segments. One segment is read into memory each time and sorted by internal sorting

These sorted segments or sub files are called sequential or merged segments

Write the sequence back to external memory to make room for memory, and then process other unordered segments

Merge the sequence after processing

  • Permutation selection sorting: initialize the external storage file into an ordered string set as long as possible
  • Merge and sort: merge and sort the sequential string set one by one to form a globally ordered external memory file

Time composition

  • The time required to generate the inner sort of the initial sequence
  • Read and write time required to initialize the sequence and merge process
  • Time required for internal consolidation

Reducing the number of reads and writes of external memory information is the key to improve the efficiency of external sorting

Permutation selection sort


Generate several initial sequence strings from the file (the longer the sequence string, the better, and the fewer the sequence strings, the better)


This is done with the help of the heap in RAM

  1. Initialize minimum heap: improve sorting efficiency in RAM
    • Read M records from the buffer and put them into the array RAM
    • Set tail flag: LAST=M-1
    • Create a minimum heap
  2. Repeat the following steps until the heap is empty (end condition) (i.e. last < 0)
    • Send the record (root node) with the minimum key value to the output buffer
    • Let R be the next record in the input buffer
      • If the key of R is not less than the key value just output, put r in the root node
      • Otherwise, use the record of the last position in the array to replace the root node, then put R in the last position (wait for the next sequential processing), and set LAST=LAST-1
    • Rearrange the heap and screen out the root node

The heap size is M, and the minimum length of the sequence is M

  • At least those records in the original heap will become part of the sequence
  • At best, it is possible to generate an entire file into a sequence at one time
  • The average length is 2M


//A is the array stored after reading n elements from external memory
template <class Elem>
void ReplacementSelection(Elem * A, int n, const char * in, const char * out)
	Elem mval;//Minimum value to store the minimum heap
	Elem r;//Stores the elements read from the input buffer
	FILE * iptF;//Input file handle
	FILE * optF;//Output file handle
	Buffer<Elem> input;//Input buffer
	Buffer<Elem> output;// Output buffer
	initFiles(inputFile, outputFile, in, out);//Initialize I / O file
	initMinHeapArry(inputFile, n, A);//Initialize the data of the heap and read in n data 
	MinHeap<Elem> H(A, n, n);//Establish minimum heap 		
	initInputBuffer(input, inputFile);//Initialize inputbuffer and read in some data 
	for(int last =n-1; last >= 0;)
		mval = H.heapArray[0];//Minimum value of heap
		sendToOutputBuffer(input,output,iptF,optF, mval);;//Reads a record from the input buffer
		if(!less(r, mval)) 	
            H.heapArray[0] = r;
		else//Otherwise, replace the root node with the last position record and put r in the last  
			H.heapArray[0] = H.heapArray[last];
			H.heapArray[last] = r;
	       if (last!=0) 		
		       H.SiftDown(0);//Heap adjustment 
	 endUp(output,inputFile,outputFile);//Process output buffer

Merge sort

Merging property

Two way merging

Merge tree height ⌈ log ⁡ 2 m ⌉ + 1 \lceil\log_{2}m\rceil+1 ⌈ log2 ⌉ m ⌉ + 1, proceed ⌈ log ⁡ 2 m ⌉ \lceil\log_{2}m\rceil ⌈ log2 ⌉ m ⌉ scanning

Two input buffers and one output buffer are required

k-way merging

k-way merging, scanning k strings each time, the number of merging times is [ log ⁡ k m ] [\log_{k}m] [logk​m]

k input buffers and one output buffer are required

Best merge tree

The arrangement of merging order affects the number of reads and writes. Taking the length of the initial sequence as the weight is essentially the problem of Huffman tree optimization


Take the number of blocks of all initial sequences as the leaf nodes of the tree. If it is k-way merging, a k-fork Huffman tree is established. Such a Huffman tree is the best merging tree

Multiway merging tree

When doing k-way merging, you need to compare K-1 times each time to find the required records, which is expensive. You want to improve the efficiency of finding the minimum value among the current values of K merging strings

Winner tree

Using complete binary tree as storage structure

Leaf nodes are represented by arrays L[1... n], and internal nodes are represented by arrays B[1... n-1]

What is stored in array B is actually the index of array L

The internal node records the winner

Relationship of nodes

n-way merging, the winner tree has 2n-1 nodes

  • The number of external nodes is n, the number of internal nodes is n-1, and the depth of the competition tree is s = ⌈ log ⁡ 2 n ⌉ − 1 s=\lceil\log_{2}n\rceil-1 s=⌈log2​n⌉−1

  • The number of the lowest and leftmost internal nodes is 2s

  • The number of internal nodes in the lowest layer is n-2s

  • The number of external nodes at the bottom layer is twice the number of internal nodes at the bottom layer, i.e L o w E x t = 2 ( n − 2 s ) LowExt=2(n-2^s) LowExt=2(n−2s)

  • The number (offset) of all nodes above the lowest external node is o f f s e t = 2 s + 1 − 1 offset=2^{s+1}-1 offset=2s+1−1

  • The relationship between the external node L[i] and the internal parent node B[p] is shown as follows
    { ( i + o f f s e t ) / 2 i ≤ L o w E x t ( i − L o w E x t + n − 1 ) / 2 i > L o w E x t \begin{cases}(i+offset)/2 &i\le LowExt\\(i-LowExt+n-1)/2 &i>LowExt \end{cases} {(i+offset)/2(i−LowExt+n−1)/2​i≤LowExti>LowExt​

Characteristics of winner tree
  • Determine the winner of a game by comparing the scores of two players

    From the bottom of the tree, there is a competition between every two leaves. The losers are eliminated, and the winners continue to compete upward. The tree root records the winners of the whole competition

  • If the score of player L[i] changes, you can modify the winner tree

    Along the path from L[i] to the root node, compare with the values of brother nodes, and modify the values of binary tree nodes according to the competition structure without modifying the competition results of other parts

Loser tree

Node records losers. Add a root node B[0] on the root node to record the last winner

When reconstructing the loser tree, it only needs to be compared with the nodes on the path, not with the sibling nodes, which simplifies the reconstruction process

Competition process
  • Match the new node entering the tree with its parent node

    Store the loser in the parent node, and then compete the winner with the parent node of the upper level

  • The game continues until node B[1]

    Put the loser's index on node B[1] and the winner's index on node B[0]

template<class T>
class LoserTree
		int MaxSize;// Maximum number of players
		int n;// Current players
		int LowExt;// Number of external nodes at the bottom layer
		int offset;// Total number of nodes above the lowest external node
		int *B;// The loser tree array actually stores subscripts
		T *L;// Element array
		void Play(int p, int lc, int rc, int(*winner)(T A[], int 	b, int c), int(*loser)(T A[], int b, int c));// In the internal node, branch up from the right
		LoserTree(int Treesize = MAX);
		~LoserTree(){delete [] B;}	
		void Initialize(T A[], int size,int (*winner)(T A[], int b, 	int c), int(*loser)(T A[], int b, int c));// Initialize loser tree	 
		int Winner();// Returns the winner index
		void RePlay(int i, int(*winner)(T A[], int b, int c), int 	(*loser)(T A[], int b, int c));// Reconstruct the loser tree 
Initialize loser tree
template<class T>
void LoserTree<T>::Initialize(T A[], int size, int(*winner)(T A[], int b, int c), int(*loser)(T A[], int b, int c))
    int i,s;
    n = size;// Initializing member variables
    L = A;
    for (s = 1; 2*s <= n-1; s += s);//Calculate the number of nodes in the penultimate layer
    LowExt = 2*(n-s);
    offset = 2*s-1;
    for (i = 2; i <= LowExt; i += 2)// Bottom external node competition
        Play((offset+i)/2, i-1, i, winner, loser);
    // Process the remaining external nodes
    if (n%2)//n is an odd number, and the internal node is compared with the external node once
        // The ratio of the left winner in the temporary parent node to the external right child node
        Play(n/2, B[(n-1)], LowExt+1, winner, loser); 	
        i = LowExt+3;
    else  i = LowExt+2;	
    for (; i <= n; i += 2)// Competition of remaining external nodes
        Play((i-LowExt+n-1)/2, i-1, i, winner, loser);
Generate loser tree
template<class T>
void LoserTree<T>::PLAY(int p, int lc, int rc, int(*winner)(T A[], int b, int c), int(*loser)(T A[], int b, int c))
    B[p] = loser(L, lc, rc);//The loser index is placed in B[p]
    int temp1, temp2;
    temp1 = winner(L, lc, rc);                                   
    while (p>1 && p%2)
        //p is an odd upward game
        temp2 = winner(L, temp1, B[p/2]);//Winner vs. father
        B[p/2] = loser(L, temp1, B[p/2]);//Losers stay
        temp1 = temp2;//Put the winner in temp1
        p/=2;//p points up to the parent node
    B[p/2] = temp1;//B[p] is the left child or p=1
Reconstruct the loser tree
void LoserTree<T>::RePlay(int i, int (*winner)(T A[], int b, int c), int (*loser)(T A[], int b, int c))
    int p;//Temporary variable used to calculate the index of the parent node
    if (i <= LowExt)//Determine the location of the parent node
        p = (i+offset)/2;
    else  p = (i-LowExt+n-1)/2;
    B[0] = winner(L, i, B[p]);// Save the winner's index in B[0]
    B[p] = loser(L, i, B[p]);// Save the loser's index in B[p]		
    for (; (p/2) >= 1; p/=2)// Race up the path
        int temp;// Index to temporarily store winners
        temp = winner(L,B[p/2], B[0]);
        B[p/2] = loser(L,B[p/2], B[0]);
        B[0] = temp;

Efficiency of multiway merging

Merge k sequential strings

  • Original method: the time to find each minimum value is O(k), and the total time to generate a sequence of size n is O(kn)
  • Loser tree method: O(k) is required to initialize the loser tree containing K players, O(logk) is required to read in a new value and reconstruct the loser tree, and the total time to generate a sequence of size n is O(k+nlogk)

Posted by philwong on Mon, 29 Nov 2021 22:00:39 -0800