# PKU data structure and algorithm -- external sorting

Keywords: Algorithm data structure

# Computer memory

## External memory characteristics

Principle: minimize the number of accesses to external memory

## External memory data access mode

It is divided into two stages: location and access

External memory is divided into fixed length storage space

The data access of external memory is carried out in blocks, so as to reduce the positioning times of external memory and the time consumption of external memory reading and writing

# Document organization and management

A file is a data structure stored in external memory and a collection of a large number of records with the same nature. Records are data blocks with independent logical meaning and basic data units

• By record type

• Operating system files

A continuous sequence of characters with no obvious structure

• Database file

A structured set of records. Each record consists of one or more data items. Each data item is a basic data unit that cannot be subdivided

• According to the length of recorded information

• Fixed length file

• Variable length file

## Document organization

The operating system organizes data in the form of files and completes the mapping from the logical structure of files to the physical structure of external memory

• File organization logic

Fixed length record, variable length record and key access record

• file physical structure

Sequential file, hash file, index file, inverted file

## C + + file stream

File stream is a data stream that takes external storage files as input and output objects

Including istream,ostream,iostream and fstream, ifstream and OFSTREAM

# External sorting

The data file in external memory is divided into several segments. One segment is read into memory each time and sorted by internal sorting

These sorted segments or sub files are called sequential or merged segments

Write the sequence back to external memory to make room for memory, and then process other unordered segments

Merge the sequence after processing

• Permutation selection sorting: initialize the external storage file into an ordered string set as long as possible
• Merge and sort: merge and sort the sequential string set one by one to form a globally ordered external memory file

Time composition

• The time required to generate the inner sort of the initial sequence
• Read and write time required to initialize the sequence and merge process
• Time required for internal consolidation

Reducing the number of reads and writes of external memory information is the key to improve the efficiency of external sorting

## Permutation selection sort

### objective

Generate several initial sequence strings from the file (the longer the sequence string, the better, and the fewer the sequence strings, the better)

### realization

This is done with the help of the heap in RAM

1. Initialize minimum heap: improve sorting efficiency in RAM
• Read M records from the buffer and put them into the array RAM
• Set tail flag: LAST=M-1
• Create a minimum heap
2. Repeat the following steps until the heap is empty (end condition) (i.e. last < 0)
• Send the record (root node) with the minimum key value to the output buffer
• Let R be the next record in the input buffer
• If the key of R is not less than the key value just output, put r in the root node
• Otherwise, use the record of the last position in the array to replace the root node, then put R in the last position (wait for the next sequential processing), and set LAST=LAST-1
• Rearrange the heap and screen out the root node

The heap size is M, and the minimum length of the sequence is M

• At least those records in the original heap will become part of the sequence
• At best, it is possible to generate an entire file into a sequence at one time
• The average length is 2M

### algorithm

//A is the array stored after reading n elements from external memory
template <class Elem>
void ReplacementSelection(Elem * A, int n, const char * in, const char * out)
{
Elem mval;//Minimum value to store the minimum heap
Elem r;//Stores the elements read from the input buffer
FILE * iptF;//Input file handle
FILE * optF;//Output file handle
Buffer<Elem> input;//Input buffer
Buffer<Elem> output;// Output buffer
initFiles(inputFile, outputFile, in, out);//Initialize I / O file
initMinHeapArry(inputFile, n, A);//Initialize the data of the heap and read in n data
MinHeap<Elem> H(A, n, n);//Establish minimum heap
initInputBuffer(input, inputFile);//Initialize inputbuffer and read in some data
for(int last =n-1; last >= 0;)
{
mval = H.heapArray[0];//Minimum value of heap
sendToOutputBuffer(input,output,iptF,optF, mval);
if(!less(r, mval))
H.heapArray[0] = r;
else//Otherwise, replace the root node with the last position record and put r in the last
{
H.heapArray[0] = H.heapArray[last];
H.heapArray[last] = r;
H.setSize(last);
last--;
}
if (last!=0)
}
endUp(output,inputFile,outputFile);//Process output buffer
}


## Merge sort

### Merging property

#### Two way merging

Merge tree height ⌈ log ⁡ 2 m ⌉ + 1 \lceil\log_{2}m\rceil+1 ⌈ log2 ⌉ m ⌉ + 1, proceed ⌈ log ⁡ 2 m ⌉ \lceil\log_{2}m\rceil ⌈ log2 ⌉ m ⌉ scanning

Two input buffers and one output buffer are required

#### k-way merging

k-way merging, scanning k strings each time, the number of merging times is [ log ⁡ k m ] [\log_{k}m] [logk​m]

k input buffers and one output buffer are required

### Best merge tree

The arrangement of merging order affects the number of reads and writes. Taking the length of the initial sequence as the weight is essentially the problem of Huffman tree optimization

#### process

Take the number of blocks of all initial sequences as the leaf nodes of the tree. If it is k-way merging, a k-fork Huffman tree is established. Such a Huffman tree is the best merging tree

### Multiway merging tree

When doing k-way merging, you need to compare K-1 times each time to find the required records, which is expensive. You want to improve the efficiency of finding the minimum value among the current values of K merging strings

#### Winner tree

Using complete binary tree as storage structure

Leaf nodes are represented by arrays L[1... n], and internal nodes are represented by arrays B[1... n-1]

What is stored in array B is actually the index of array L

The internal node records the winner

##### Relationship of nodes

n-way merging, the winner tree has 2n-1 nodes

• The number of external nodes is n, the number of internal nodes is n-1, and the depth of the competition tree is s = ⌈ log ⁡ 2 n ⌉ − 1 s=\lceil\log_{2}n\rceil-1 s=⌈log2​n⌉−1

• The number of the lowest and leftmost internal nodes is 2s

• The number of internal nodes in the lowest layer is n-2s

• The number of external nodes at the bottom layer is twice the number of internal nodes at the bottom layer, i.e L o w E x t = 2 ( n − 2 s ) LowExt=2(n-2^s) LowExt=2(n−2s)

• The number (offset) of all nodes above the lowest external node is o f f s e t = 2 s + 1 − 1 offset=2^{s+1}-1 offset=2s+1−1

• The relationship between the external node L[i] and the internal parent node B[p] is shown as follows
{ ( i + o f f s e t ) / 2 i ≤ L o w E x t ( i − L o w E x t + n − 1 ) / 2 i > L o w E x t \begin{cases}(i+offset)/2 &i\le LowExt\\(i-LowExt+n-1)/2 &i>LowExt \end{cases} {(i+offset)/2(i−LowExt+n−1)/2​i≤LowExti>LowExt​

##### Characteristics of winner tree
• Determine the winner of a game by comparing the scores of two players

From the bottom of the tree, there is a competition between every two leaves. The losers are eliminated, and the winners continue to compete upward. The tree root records the winners of the whole competition

• If the score of player L[i] changes, you can modify the winner tree

Along the path from L[i] to the root node, compare with the values of brother nodes, and modify the values of binary tree nodes according to the competition structure without modifying the competition results of other parts

#### Loser tree

Node records losers. Add a root node B[0] on the root node to record the last winner

When reconstructing the loser tree, it only needs to be compared with the nodes on the path, not with the sibling nodes, which simplifies the reconstruction process

##### Competition process
• Match the new node entering the tree with its parent node

Store the loser in the parent node, and then compete the winner with the parent node of the upper level

• The game continues until node B[1]

Put the loser's index on node B[1] and the winner's index on node B[0]

##### algorithm
template<class T>
class LoserTree
{
private:
int MaxSize;// Maximum number of players
int n;// Current players
int LowExt;// Number of external nodes at the bottom layer
int offset;// Total number of nodes above the lowest external node
int *B;// The loser tree array actually stores subscripts
T *L;// Element array
void Play(int p, int lc, int rc, int(*winner)(T A[], int 	b, int c), int(*loser)(T A[], int b, int c));// In the internal node, branch up from the right
public:
LoserTree(int Treesize = MAX);
~LoserTree(){delete [] B;}
void Initialize(T A[], int size,int (*winner)(T A[], int b, 	int c), int(*loser)(T A[], int b, int c));// Initialize loser tree
int Winner();// Returns the winner index
void RePlay(int i, int(*winner)(T A[], int b, int c), int 	(*loser)(T A[], int b, int c));// Reconstruct the loser tree
};

###### Initialize loser tree
template<class T>
void LoserTree<T>::Initialize(T A[], int size, int(*winner)(T A[], int b, int c), int(*loser)(T A[], int b, int c))
{
int i,s;
n = size;// Initializing member variables
L = A;
for (s = 1; 2*s <= n-1; s += s);//Calculate the number of nodes in the penultimate layer
LowExt = 2*(n-s);
offset = 2*s-1;
for (i = 2; i <= LowExt; i += 2)// Bottom external node competition
Play((offset+i)/2, i-1, i, winner, loser);
// Process the remaining external nodes
if (n%2)//n is an odd number, and the internal node is compared with the external node once
{
// The ratio of the left winner in the temporary parent node to the external right child node
Play(n/2, B[(n-1)], LowExt+1, winner, loser);
i = LowExt+3;
}
else  i = LowExt+2;
for (; i <= n; i += 2)// Competition of remaining external nodes
Play((i-LowExt+n-1)/2, i-1, i, winner, loser);
}

###### Generate loser tree
template<class T>
void LoserTree<T>::PLAY(int p, int lc, int rc, int(*winner)(T A[], int b, int c), int(*loser)(T A[], int b, int c))
{
B[p] = loser(L, lc, rc);//The loser index is placed in B[p]
int temp1, temp2;
temp1 = winner(L, lc, rc);
while (p>1 && p%2)
{
//p is an odd upward game
temp2 = winner(L, temp1, B[p/2]);//Winner vs. father
B[p/2] = loser(L, temp1, B[p/2]);//Losers stay
temp1 = temp2;//Put the winner in temp1
p/=2;//p points up to the parent node
}
B[p/2] = temp1;//B[p] is the left child or p=1
}

###### Reconstruct the loser tree
void LoserTree<T>::RePlay(int i, int (*winner)(T A[], int b, int c), int (*loser)(T A[], int b, int c))
{
int p;//Temporary variable used to calculate the index of the parent node
if (i <= LowExt)//Determine the location of the parent node
p = (i+offset)/2;
else  p = (i-LowExt+n-1)/2;
B[0] = winner(L, i, B[p]);// Save the winner's index in B[0]
B[p] = loser(L, i, B[p]);// Save the loser's index in B[p]
for (; (p/2) >= 1; p/=2)// Race up the path
{
int temp;// Index to temporarily store winners
temp = winner(L,B[p/2], B[0]);
B[p/2] = loser(L,B[p/2], B[0]);
B[0] = temp;
}
}


#### Efficiency of multiway merging

Merge k sequential strings

• Original method: the time to find each minimum value is O(k), and the total time to generate a sequence of size n is O(kn)
• Loser tree method: O(k) is required to initialize the loser tree containing K players, O(logk) is required to read in a new value and reconstruct the loser tree, and the total time to generate a sequence of size n is O(k+nlogk)

Posted by philwong on Mon, 29 Nov 2021 22:00:39 -0800