Computer memory
External memory characteristics
Advantages: permanent storage capacity, portability
Disadvantages: long access time
Principle: minimize the number of accesses to external memory
External memory data access mode
It is divided into two stages: location and access
External memory is divided into fixed length storage space
The data access of external memory is carried out in blocks, so as to reduce the positioning times of external memory and the time consumption of external memory reading and writing
Document organization and management
A file is a data structure stored in external memory and a collection of a large number of records with the same nature. Records are data blocks with independent logical meaning and basic data units
-
By record type
-
Operating system files
A continuous sequence of characters with no obvious structure
-
Database file
A structured set of records. Each record consists of one or more data items. Each data item is a basic data unit that cannot be subdivided
-
-
According to the length of recorded information
-
Fixed length file
-
Variable length file
-
Document organization
The operating system organizes data in the form of files and completes the mapping from the logical structure of files to the physical structure of external memory
-
File organization logic
Fixed length record, variable length record and key access record
-
file physical structure
Sequential file, hash file, index file, inverted file
C + + file stream
File stream is a data stream that takes external storage files as input and output objects
Including istream,ostream,iostream and fstream, ifstream and OFSTREAM
External sorting
The data file in external memory is divided into several segments. One segment is read into memory each time and sorted by internal sorting
These sorted segments or sub files are called sequential or merged segments
Write the sequence back to external memory to make room for memory, and then process other unordered segments
Merge the sequence after processing
- Permutation selection sorting: initialize the external storage file into an ordered string set as long as possible
- Merge and sort: merge and sort the sequential string set one by one to form a globally ordered external memory file
Time composition
- The time required to generate the inner sort of the initial sequence
- Read and write time required to initialize the sequence and merge process
- Time required for internal consolidation
Reducing the number of reads and writes of external memory information is the key to improve the efficiency of external sorting
Permutation selection sort
objective
Generate several initial sequence strings from the file (the longer the sequence string, the better, and the fewer the sequence strings, the better)
realization
This is done with the help of the heap in RAM
- Initialize minimum heap: improve sorting efficiency in RAM
- Read M records from the buffer and put them into the array RAM
- Set tail flag: LAST=M-1
- Create a minimum heap
- Repeat the following steps until the heap is empty (end condition) (i.e. last < 0)
- Send the record (root node) with the minimum key value to the output buffer
- Let R be the next record in the input buffer
- If the key of R is not less than the key value just output, put r in the root node
- Otherwise, use the record of the last position in the array to replace the root node, then put R in the last position (wait for the next sequential processing), and set LAST=LAST-1
- Rearrange the heap and screen out the root node
The heap size is M, and the minimum length of the sequence is M
- At least those records in the original heap will become part of the sequence
- At best, it is possible to generate an entire file into a sequence at one time
- The average length is 2M
algorithm
//A is the array stored after reading n elements from external memory template <class Elem> void ReplacementSelection(Elem * A, int n, const char * in, const char * out) { Elem mval;//Minimum value to store the minimum heap Elem r;//Stores the elements read from the input buffer FILE * iptF;//Input file handle FILE * optF;//Output file handle Buffer<Elem> input;//Input buffer Buffer<Elem> output;// Output buffer initFiles(inputFile, outputFile, in, out);//Initialize I / O file initMinHeapArry(inputFile, n, A);//Initialize the data of the heap and read in n data MinHeap<Elem> H(A, n, n);//Establish minimum heap initInputBuffer(input, inputFile);//Initialize inputbuffer and read in some data for(int last =n-1; last >= 0;) { mval = H.heapArray[0];//Minimum value of heap sendToOutputBuffer(input,output,iptF,optF, mval); input.read(r);//Reads a record from the input buffer if(!less(r, mval)) H.heapArray[0] = r; else//Otherwise, replace the root node with the last position record and put r in the last { H.heapArray[0] = H.heapArray[last]; H.heapArray[last] = r; H.setSize(last); last--; } if (last!=0) H.SiftDown(0);//Heap adjustment } endUp(output,inputFile,outputFile);//Process output buffer }
Merge sort
Merging property
Two way merging
Merge tree height ⌈ log 2 m ⌉ + 1 \lceil\log_{2}m\rceil+1 ⌈ log2 ⌉ m ⌉ + 1, proceed ⌈ log 2 m ⌉ \lceil\log_{2}m\rceil ⌈ log2 ⌉ m ⌉ scanning
Two input buffers and one output buffer are required
k-way merging
k-way merging, scanning k strings each time, the number of merging times is [ log k m ] [\log_{k}m] [logkm]
k input buffers and one output buffer are required
Best merge tree
The arrangement of merging order affects the number of reads and writes. Taking the length of the initial sequence as the weight is essentially the problem of Huffman tree optimization
process
Take the number of blocks of all initial sequences as the leaf nodes of the tree. If it is k-way merging, a k-fork Huffman tree is established. Such a Huffman tree is the best merging tree
Multiway merging tree
When doing k-way merging, you need to compare K-1 times each time to find the required records, which is expensive. You want to improve the efficiency of finding the minimum value among the current values of K merging strings
Winner tree
Using complete binary tree as storage structure
Leaf nodes are represented by arrays L[1... n], and internal nodes are represented by arrays B[1... n-1]
What is stored in array B is actually the index of array L
The internal node records the winner
Relationship of nodes
n-way merging, the winner tree has 2n-1 nodes
-
The number of external nodes is n, the number of internal nodes is n-1, and the depth of the competition tree is s = ⌈ log 2 n ⌉ − 1 s=\lceil\log_{2}n\rceil-1 s=⌈log2n⌉−1
-
The number of the lowest and leftmost internal nodes is 2s
-
The number of internal nodes in the lowest layer is n-2s
-
The number of external nodes at the bottom layer is twice the number of internal nodes at the bottom layer, i.e L o w E x t = 2 ( n − 2 s ) LowExt=2(n-2^s) LowExt=2(n−2s)
-
The number (offset) of all nodes above the lowest external node is o f f s e t = 2 s + 1 − 1 offset=2^{s+1}-1 offset=2s+1−1
-
The relationship between the external node L[i] and the internal parent node B[p] is shown as follows
{ ( i + o f f s e t ) / 2 i ≤ L o w E x t ( i − L o w E x t + n − 1 ) / 2 i > L o w E x t \begin{cases}(i+offset)/2 &i\le LowExt\\(i-LowExt+n-1)/2 &i>LowExt \end{cases} {(i+offset)/2(i−LowExt+n−1)/2i≤LowExti>LowExt
Characteristics of winner tree
-
Determine the winner of a game by comparing the scores of two players
From the bottom of the tree, there is a competition between every two leaves. The losers are eliminated, and the winners continue to compete upward. The tree root records the winners of the whole competition
-
If the score of player L[i] changes, you can modify the winner tree
Along the path from L[i] to the root node, compare with the values of brother nodes, and modify the values of binary tree nodes according to the competition structure without modifying the competition results of other parts
Loser tree
Node records losers. Add a root node B[0] on the root node to record the last winner
When reconstructing the loser tree, it only needs to be compared with the nodes on the path, not with the sibling nodes, which simplifies the reconstruction process
Competition process
-
Match the new node entering the tree with its parent node
Store the loser in the parent node, and then compete the winner with the parent node of the upper level
-
The game continues until node B[1]
Put the loser's index on node B[1] and the winner's index on node B[0]
algorithm
template<class T> class LoserTree { private: int MaxSize;// Maximum number of players int n;// Current players int LowExt;// Number of external nodes at the bottom layer int offset;// Total number of nodes above the lowest external node int *B;// The loser tree array actually stores subscripts T *L;// Element array void Play(int p, int lc, int rc, int(*winner)(T A[], int b, int c), int(*loser)(T A[], int b, int c));// In the internal node, branch up from the right public: LoserTree(int Treesize = MAX); ~LoserTree(){delete [] B;} void Initialize(T A[], int size,int (*winner)(T A[], int b, int c), int(*loser)(T A[], int b, int c));// Initialize loser tree int Winner();// Returns the winner index void RePlay(int i, int(*winner)(T A[], int b, int c), int (*loser)(T A[], int b, int c));// Reconstruct the loser tree };
Initialize loser tree
template<class T> void LoserTree<T>::Initialize(T A[], int size, int(*winner)(T A[], int b, int c), int(*loser)(T A[], int b, int c)) { int i,s; n = size;// Initializing member variables L = A; for (s = 1; 2*s <= n-1; s += s);//Calculate the number of nodes in the penultimate layer LowExt = 2*(n-s); offset = 2*s-1; for (i = 2; i <= LowExt; i += 2)// Bottom external node competition Play((offset+i)/2, i-1, i, winner, loser); // Process the remaining external nodes if (n%2)//n is an odd number, and the internal node is compared with the external node once { // The ratio of the left winner in the temporary parent node to the external right child node Play(n/2, B[(n-1)], LowExt+1, winner, loser); i = LowExt+3; } else i = LowExt+2; for (; i <= n; i += 2)// Competition of remaining external nodes Play((i-LowExt+n-1)/2, i-1, i, winner, loser); }
Generate loser tree
template<class T> void LoserTree<T>::PLAY(int p, int lc, int rc, int(*winner)(T A[], int b, int c), int(*loser)(T A[], int b, int c)) { B[p] = loser(L, lc, rc);//The loser index is placed in B[p] int temp1, temp2; temp1 = winner(L, lc, rc); while (p>1 && p%2) { //p is an odd upward game temp2 = winner(L, temp1, B[p/2]);//Winner vs. father B[p/2] = loser(L, temp1, B[p/2]);//Losers stay temp1 = temp2;//Put the winner in temp1 p/=2;//p points up to the parent node } B[p/2] = temp1;//B[p] is the left child or p=1 }
Reconstruct the loser tree
void LoserTree<T>::RePlay(int i, int (*winner)(T A[], int b, int c), int (*loser)(T A[], int b, int c)) { int p;//Temporary variable used to calculate the index of the parent node if (i <= LowExt)//Determine the location of the parent node p = (i+offset)/2; else p = (i-LowExt+n-1)/2; B[0] = winner(L, i, B[p]);// Save the winner's index in B[0] B[p] = loser(L, i, B[p]);// Save the loser's index in B[p] for (; (p/2) >= 1; p/=2)// Race up the path { int temp;// Index to temporarily store winners temp = winner(L,B[p/2], B[0]); B[p/2] = loser(L,B[p/2], B[0]); B[0] = temp; } }
Efficiency of multiway merging
Merge k sequential strings
- Original method: the time to find each minimum value is O(k), and the total time to generate a sequence of size n is O(kn)
- Loser tree method: O(k) is required to initialize the loser tree containing K players, O(logk) is required to read in a new value and reconstruct the loser tree, and the total time to generate a sequence of size n is O(k+nlogk)