Python data structure and algorithm -- search and sort

Keywords: Python Algorithm data structure

search

1. Sequential search

  • Through subscripts, we can access and search data items in order. This technology is called "sequential search"
def sequentialSearch(alist,item):
    pos = 0
    found = False

    while pos<len(alist) and not found:
        if alist[pos] == item:
            found = True
        else:
            pos = pos + 1

    return  found

testlist = [1, 2,32,8,17, 19, 42,13, 0]
print( sequentialSearch(testlist,3))
print ( sequentialSearch(testlist, 13))

# result 
"""
False
True
"""
  • In the sequential search algorithm, in order to ensure that it is the general case discussed, it is necessary to assume that the data items in the list are not arranged in the order of values, but are randomly placed in each position in the list (in other words, the probability of data items appearing everywhere in the list is the same)
  • Analysis of sequential search algorithm for unordered tables
If the data item is not in the list, you need to compare all the data items to know. The comparison times are n


If the data item is in the list, the number of times to compare is more complex
 Because the probability of data items appearing at each position in the list is the same; So on average, the number of comparisons is n/2;

Therefore, the algorithm complexity of sequential search is O(n)
  • Sequential search of ordered table: it is almost the same as that of unordered table, but the ordered table can end in advance. For example, the search is 50, and the comparison is 54 (the data is in ascending order), indicating that the following data is greater than 50, there can be no 50, and you can exit in advance
  • Sequential table sequential search code
def orderedSequentialSearch(alist, item):
    pos = 0
    found = False
    stop = False
    while pos < len(alist) and not found and not stop:
        if alist[pos] == item:
            found = True
        else:
            if alist[pos] > item:
                stop = True
            else:
                pos = pos+1
    return found

testlist = [0, 1, 2, 8, 13, 17, 19, 32, 42,]
print(orderedSequentialSearch(testlist, 3))
print(orderedSequentialSearch(testlist, 13))
  • Analysis of sequential search algorithm for ordered table
If the data item is not in the list, the average number of comparisons is n/2


If the data item is in the list, the number of times to compare is more complex
 Because the probability of data items appearing at each position in the list is the same; So on average, the number of comparisons is n/2;

In fact, even in terms of legal complexity, it is still O(n)

Only when the data item does not exist, the search of the ordered table can save some comparison times, but does not change its order of magnitude

2. Binary search

  • Compare from the middle of the list
    If the item in the middle of the list matches the search item, the search ends
    If not, there are two situations:
    • if the middle item of the list is larger than the search item, the search item may only appear in the first half
    • if the middle item of the list is smaller than the search item, the search item may only appear in the second half
    In any case, we will reduce the comparison range to half of the original: n/2
    Continue to use the above method to search, and each time the comparison range will be reduced by half
    This search method is binary search

  • Binary search: Code

def binarySearch(alist, item):
    first = 0
    last = len(alist)-1
    found = False
    while first<=last and not found:
        midpoint = (first + last)//2
        if alist[midpoint] == item:  # Intermediate term comparison
            found = True
        else:
            if item < alist[midpoint]: # Narrow the contrast range
                last = midpoint-1
            else:
                first = midpoint+1
    return found

testlist = [0, 1, 2, 8, 13, 17, 19, 32, 42,]
print (binarySearch(testlist, 3))
print ( binarySearch(testlist, 13))

# result 
"""
False
True
"""
# Binary search algorithm actually embodies a typical strategy to solve the problem: divide and conquer
# The problem is divided into several smaller parts. The solution of the original problem is obtained by solving each small part of the problem and summarizing the results
# Obviously, recursive algorithm is a typical divide and conquer strategy algorithm, and dichotomy is also suitable for recursive algorithm

def binarySearch(alist, item):
    if len(alist) == 0:
        return False
    else:
        midpoint = len(alist)//2
        if alist [midpoint ]==item:
            return True
        else:
            if item<alist [midpoint] :
                return binarySearch( alist[ : midpoint], item)
            else:
                return binarySearch(alist [midpoint+1:], item)

testlist = [0, 1, 2, 8, 13, 17, 19, 32, 42,]
print (binarySearch(testlist, 3))
print ( binarySearch(testlist, 13))

# result 
"""
False
True
"""
  • Analysis of binary search algorithm
Due to binary search, each comparison will reduce the comparison scope of the next step by half
 When the comparison times are enough, only one data item will remain in the comparison range
 Whether the data item matches the search item or not, the comparison will eventually end
n/2^i = 1
i = log2(n)
So the algorithm complexity of dichotomy search is O(log n)
Although we get the complexity of binary search according to the number of comparisons O(log n)
However, in addition to comparison, there is another factor to be noted in this algorithm:
binarySearch(alist[:midpoint],item)
This recursive call uses list slicing, and the complexity of the slicing operation is O(k),This will slightly increase the time complexity of the whole algorithm;
Of course, we use slicing for better program readability. In fact, we can not slice,
Just pass in the start and end index values, so there will be no time overhead of slicing.

In addition, although binary search is superior to sequential search in time complexity
 But also consider the cost of sorting data items 
If multiple searches can be carried out after one sorting, the cost of sorting can be diluted
 However, if the data set changes frequently and the number of searches is relatively small, it may be more economical to directly use the unordered table and sequential search

Therefore, in the problem of algorithm selection, it is not enough to only look at the advantages and disadvantages of time complexity, but also consider the actual application

3. Hash

  • Some basic concepts of hash
Previously, we used the knowledge about the arrangement relationship between data items in the data set to improve the search algorithm
 If the data items are arranged in order according to size, binary search can be used to reduce the complexity of the algorithm.


Now we further construct a new data structure, which can reduce the complexity of the search algorithm to O(1),
This concept is called hashing Hashing"

If we can reduce the number of searches to the constant level, we must have more a priori knowledge of the location of the data item.


If we know in advance where the data item to be found should appear in the data set, we can go directly to that location to see if the data item exists

The hash algorithm maps the binary value of any length to a short fixed length binary value. This small binary value is called the hash value. The hash value is unique to a piece of data
 And extremely compact numerical representation. If you hash a plaintext and change even one letter of the paragraph, the subsequent hash will produce different values. To find
 Hashing two different inputs of the same value is computationally impossible, so the hash value of the data can check the integrity of the data. It is generally used for fast search and
 encryption algorithm 

Hash table( hash table,Also known as hash table) is a data set, in which the storage mode of data items is particularly conducive to rapid search and location in the future.

Each storage location in the hash table is called a slot( slot),It can be used to save data items. Each slot has a unique name.

The function that implements the conversion from data item to storage slot name is called hash function( hash function)

The proportion of slots occupied by data items is called the "load factor" of the hash table
  • Example (more in-depth understanding of hashes)
For example, a hash table containing 11 slots whose names are 0~ 10
 Before inserting a data item, the value of each slot is None,Indicates an empty slot
 The function that implements the conversion from data item to storage slot name is called hash function( hash function)
In the following example, the hash function takes a data item as a parameter and returns an integer value of 0~ 10,Represents the slot number (name) of the data item store
 In order to save the data items to the hash table, we design the first hash function
    Data items: 54, 26, 93, 17, 77, 31
 A common hash method is "remainder". Divide the data item by the size of the hash table, and the remainder obtained is used as the slot number.

In fact, the "remainder" method will appear in all hash functions in different forms

Because the slot number returned by the hash function must be within the Hash list size range, the Hash list size is generally summed
 In this example, our hash function is the simplest remainder: h(item)= item % 11
 According to hash function h(item),After calculating the storage location for each data item, the data item can be stored in the corresponding slot

Item     HashValue
54          10
26          4
93          5
17          6
77          0   
31          9
 After the 6 data items in the example are inserted, they occupy 6 of the 11 slots in the hash table
 The proportion of slots occupied by data items is called the "load factor" of hash table, where the load factor is 6/11

After all the data items are saved to the hash table, the search is very simple
 To search whether a data item exists in the table, we only need to use the same hash function to calculate the search item and test the slot corresponding to the returned slot number
 Whether there are data items.Realized O(1)Time complexity search algorithm.

However, you may also see the problem with this scheme. This group of data happens to occupy different slots

If you want to save 44, h(44)=0,It is assigned to the same 0 as 77#In the groove, this situation is called "punching"
Outburst collision" ,We will discuss the solution to this problem later
  • Perfect Hashing Function
❖ Given a set of data items, if a hash function can map each data item to a different slot, the hash function can be called "perfect hash function"     
❖ For a fixed set of data, we can always find a way to design a perfect hash function   
❖ However, if the data items change frequently, it is difficult to have a systematic method to design the corresponding perfect hash function     
❖ One way to get a perfect hash function is to expand the capacity of the hash table so that all possible data items can occupy different slots      
❖ However, this method is not practical when the range of possible data items is too large         
(If we want to save the mobile phone number (11 digits), the perfect hash function requires the hash table to have 10 billion slots! It will waste too much storage space)      
❖ Second, a good hash function needs to have the least feature conflict (approximately perfect), low computational difficulty (low overhead), and fully disperse data items (save space)
  • More uses for perfect hash functions
❖ In addition to arranging the storage location of data items in the hash table, hash technology is also used in many fields of information processing
❖ Because the perfect hash function can generate different hash values for any different data, if the hash value is regarded as the "fingerprint" or "summary" of the data, this feature is widely used in data consistency verification
    Generating a fixed length "fingerprint" from arbitrary length data also requires uniqueness, which is mathematically impossible, but the ingenious design of "quasi perfection"
    Hash functions can do this in a practical sense

❖As a data fingerprint function for consistency verification, the following characteristics are required
    Compressibility: for any length of data, the length of "fingerprint" is fixed;
    Computability: it is easy to calculate the "fingerprint" from the original data; (it is impossible to calculate the original data from the fingerprint);
    Anti modification: small changes to the original data will cause great changes in the "fingerprint";
    Anti conflict: given the original data and "fingerprint", it is very difficult to find the data (forged) with the same fingerprint

  • Python hash library hashlib
# Python's own hash function library of MD5 and SHA series: hashlib
# It includes six hash functions such as md5 / sha1 / sha224 / sha256 /sha384 / sha512

import hashlib
print(hashlib.md5("hello world!".encode("utf8")).hexdigest())
print(hashlib.md5("hello world".encode("utf8")).hexdigest())
print(hashlib.sha1("hello world!".encode("utf8")).hexdigest())

m = hashlib.md5()
m.update("hello world!".encode("utf8"))

print(m.hexdigest())



result
"""
fc3ff98e8c6a0d3087d515c0473f8677
5eb63bbbe01eeed093cb22bb8f5acdc3
430ce34d020724ed75a196dfc2ad67c77772d169
fc3ff98e8c6a0d3087d515c0473f8677
"""
  • Perfect hash function for data consistency verification
❖Data file consistency judgment
❖Calculate the hash value for each file, and only compare the hash value to know whether the file content is the same;
❖Used for network file download integrity verification;
❖For file sharing system: the same files (especially movies) in the network disk can be stored without multiple times
❖Save password in encrypted form
❖Only the hash value of the password is saved. After the user enters the password, the hash value is calculated and compared;
❖It is not necessary to save the plaintext of the password to judge whether the user has entered the correct password.
(Calculating hash values from values is simple, but the reverse is not possible)
  • Hash function design: folding method
❖ The basic steps of designing hash function by folding method are
    The data items are divided into several segments according to the number of digits,
    Add a few more numbers,
    Finally, the hash size is summed to obtain the hash value
❖ For example, for the phone number 62767255
    It can be divided into 4 segments (62, 76, 72, 55) in two digits
    Add (62)+76+72+55=265)
    The hash table includes 11 slots, which is 265%11=1
    therefore h(62767255)=1
❖ Sometimes the folding method also includes an interval inversion step
    For example, the interval number of (62, 76, 72, 55) is reversed to (62, 67, 72, 55)
    Re accumulation (62)+67+72+55=256)
    Remainder of 11 (256)%11=3),therefore h'(62767255)=3
❖ Although interval inversion is not necessary in theory, this step does provide a fine-tuning means for the folding method to obtain the hash function, so as to better meet the hash characteristics

  • Hash function design: square middle method
❖ Square middle method, first square the data item, then take the middle two digits of the square, and then remainder the size of the hash table
❖ For example, hash 44
    First 44*44=1936
    Then take the middle 93
    Remainder hash table size 11, 93%11=5

  • Hash function design: non numeric terms
"""
❖ We can also hash non numeric data items and treat each character in the string as ASCII Code is enough
    as cat, ord('c')==99, ord('a')==96,ord('t')==116
❖ Then these integers are accumulated to sum up the size of the hash table
"""
def hash(astring,tablesize):
    sum = 0
    for pos in range(len(astring)):
        sum = sum + ord(astring[pos])

    return sum%tablesize

# The ord function can convert characters into ASCII code you need
# Of course, such a hash function returns the same hash value for all out of order words
# To prevent this, you can multiply the position of the string by the ord value as a weighting factor
  • Hash function design
❖ We can also design more hash function methods, but a basic starting point to adhere to is that hash function can not become the computational burden of stored procedures and search procedures
❖ If the hash function design is too complex, it will cost a lot of computing resources to calculate the slot number,Lost the meaning of the hash itself
    It may be better to simply perform sequential search or binary search

  • Conflict resolution
❖ If two data items are hashed to the same slot, a systematic method is needed to save the second data item in the hash table. This process is called "conflict resolution"
❖ As mentioned earlier, if the hash function is perfect, there will be no hash conflict, but the perfect hash function is often unrealistic
❖ Resolving hash conflicts has become a very important part of hash methods.
❖ One way to solve hash is to find an open empty slot for conflicting data items
    The simplest is to scan back from the conflicting slot until an empty slot is encountered
    If not found at the end of the hash table, scan from the head
❖ This technique for finding empty slots is called "open addressing" open addressing"
❖The backward slot by slot search method is "linear detection" in open addressing technology linear probing"
  • Conflict resolution: Linear Probing
    [external chain picture transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-mzgnuna9-1634894171242)( https://note.youdao.com/yws/res/c/WEBRESOURCE25547ea06df98ac432fd25d7be0d232c )]
We insert 44, 55 and 20 into the hash table one by one
    h(44)=0,But found 0#Slot is occupied by 77. Find the first empty slot 1 backward#, save
    h(55)=0,Same 0#The slot has been occupied. Find the first empty slot 2 backward#, save
    h(20)=9,Discovery 9#The slot has been occupied by 31. Go back and find 3 from the beginning#Slot save
    
If the linear detection method is used to solve the hash conflict, the hash search also follows the same rules
    If the search term is not found in the hash position, you must search backward in order
    Until the search term is found or an empty slot is encountered (search fails).
  • Conflict resolution: improvement of linear detection
❖ One disadvantage of linear detection method is aggregation( clustering)Trend of
❖ That is, if there are many conflicting data items in the same slot, these data items will gather near the slot
❖ So as to chain affect the insertion of other data items
❖ One way to avoid aggregation is to expand linear detection from one by one to jump detection
    The following figure is“ +3" Probe inserts 44, 55, 20

  • Conflict resolution: rehashing
❖ The process of re finding empty slots can be done with a more general "re hashing" rehashing" To summarize
    newhashvalue = rehash(oldhashvalue)
    For linear detection, rehash(pos)= (pos+ 1)%sizeoftable
    " +3" The hopping detection is: rehash(pos)=(pos+ 3)% sizeoftable
    The general hashing formula for skip detection is: rehash(pos)=(pos+skip)% sizeoftable

❖ In jump detection, it should be noted that skip The value of cannot be divided by the hash table size, otherwise a cycle will occur, resulting in many empty slots that can never be detected
    One trick is to set the size of the hash table to prime, as in example 11
    
❖ You can also change linear detection to "secondary detection" quadratic probing"
❖ No longer fixed skip But gradually increase skip Values, such as 1, 3, 5, 7, 9
❖ In this way, the slot number will be the original hash value, increasing by square: h, h+1, h+4, h+9, h+16...

  • Conflict resolution: Data Necklace Chaining
❖In addition to the open addressing technology of finding empty slots, another solution to hash conflict is to expand the slot containing a single data item to accommodate a collection of data items (or a reference to a data Necklace table)
❖In this way, each slot in the hash table can accommodate multiple data items. If there is a hash conflict, you only need to simply add the data items to the data item collection.
❖When searching for data items, you need to search the entire set in the same slot. Of course, with the increase of hash conflict, the search time for data items will increase accordingly.

Mapping abstract data types and Python implementation

  • Abstract data type mapping: ADT Map
❖ Python One of the most useful data types "dictionary"
❖ A dictionary is a kind of dictionary that can be saved key-data Data type of key value pair
    Where key key Can be used to query associated data values data
❖ This method of key value association is called mapping Map"
❖ ADT Map The structure of is a key-Unordered collection of values associated
    Keys are unique
    A data value can be uniquely determined by a key


realization ADT Map
❖ The advantage of using a dictionary is that given a key key,The associated data values can be obtained quickly data
❖ In order to achieve the goal of fast search, it is necessary to support efficient search ADT realization
    You can use list data structure plus sequential search or binary search. Of course, it is more appropriate to use the aforementioned hash table to achieve the fastest search O(1)Performance of
  • Operations defined by ADT Map
Map(): Create an empty mapping and return the empty mapping object;
put(key, val): take key‐val If the association pair is added to the mapping key Already exists, will val Replace the old association value;
get(key): given key,Returns the associated data value. If it does not exist, it returns None;
del: adopt del map[key]Delete as a statement key‐val relation;
len(): Return to mapping key‐val Number of associations;
in: adopt key in map Statement form, return key Whether it exists in the association, Boolean
  • Implementation example of ADT Map
#We use a HashTable class to implement ADT Map, which contains two lists as members
#    One of the slot lists is used to save the key
#    Another parallel data list is used to hold data items
# After the location of a key is found in the slot list, the data items corresponding to the same location in the data list are associated data

class HashTable:
    def __init__ (self):
        self.size = 11
        self.slots = [None]*self.size
        self.data = [None]*self.size

    def hashfunction(self, key):
        return key% self.size
    def rehash( self, oldhash):
        return (oldhash+ 1)% self.size
    def put(self,key,data):
        hashvalue = self.hashfunction(key)

        if self.slots [hashvalue] == None:
            self.slots[hashvalue] = key
            self.data [hashvalue] = data
        else:
            if self.slots [hashvalue] == key:
                self.data[hashvalue] = data #replace
            else:
                nextslot = self.rehash(hashvalue)

                while self.slots[nextslot] != None and self.slots [nextslot] != key:
                    nextslot = self.rehash(nextslot)
                if self.slots [nextslot] == None:
                    self . slots[nextslot]=key
                    self . data [nextslot]=data
                else:
                    self.data[nextslot] = data #replace

    def get(self,key):
        startslot = self.hashfunction(key)  # Mark the hash value as the starting point of the search
        data = None
        stop = False
        found = False
        position = startslot
        while self.slots[position] != None and not found and not stop:
            # Find the key until the slot is empty or back to the starting point
            if self.slots[position] == key:
                found = True
                data = self . data[position]
            else:
                position=self.rehash( position) # key not found, hash and continue to find
            if position == startslot:  # Back to the starting point, stop
                stop = True
        return data

    # [] access through special methods
    def __getitem__(self, key):
        return self.get(key)
    def __setitem__(self, key, data):
        self.put(key, data)


H=HashTable()
H[54]="cat"
H[26]= "dog"
H[93]="lion"
H[17]="tiger"
H[77]="bird"
H[31]="cow"
H[44]="goat"
H[55]="pig"
H[ 20]= "chicken"
print(H.slots)
print(H.data)
print(H[20])
print(H[17])
H[20]= 'duck'
print(H[20])
print(H[99])

# result
"""
[77, 44, 55, 20, 26, 93, 17, None, None, 31, 54]
['bird', 'goat', 'pig', 'chicken', 'dog', 'lion', 'tiger', None, None, 'cow', 'cat']
chicken
tiger
duck
None
"""
  • Hash algorithm analysis
❖ In the best case, hashing can provide O(1)Search performance of constant time complexity
    Due to the existence of hash conflict, the number of search comparisons is not so simple
❖ The most important information for evaluating hash conflicts is the load factorλ,
    In general:
    IfλSmaller, the probability of hash collision is small, and data items are usually saved in the hash slot to which they belong
    IfλLarger means that the hash table is filled with more and more conflicts, and the conflict resolution is more complex, so more comparisons are needed to find empty slots; if a data chain is used, it means that there are more data items on each chain
❖ If the open addressing method of linear detection is used to solve the conflict(λAt 0~1 Between)
    For successful searches, the average number of comparisons required is: 0.5*(1 + 1/(1-λ))  Unsuccessful search, average comparison times: 0.5*(1 + 1/(1-λ)^2)
❖ If the data link is used to solve the conflict(λ(can be greater than 1)
    The average number of comparisons required for a successful search is: 1 + λ/2
    For unsuccessful searches, the average comparison times are: λ

sort

1. Bubble Sort

# code
def bubbleSort(alist):

    for passnum in range(len(alist)-1,0,-1):
        for i in range(passnum):
            if alist[i]>alist[i+1]:
                temp = alist[i]
                alist[i] = alist[i+1]
                alist[i+1] = temp

alist = [54,26,93,17,77,31,44,55,20]
bubbleSort(alist)
print(alist)

# result
# [17, 20, 26, 31, 44, 54, 55, 77, 93]

# Bubble sorting: algorithm analysis
# The comparison times are the accumulation of 1 ~ n-1: 1/2 * n * (n-1)
# The time complexity of the comparison is O(n^2)

"""
❖ The time complexity is also related to the number of exchanges O(n2^),Usually, each exchange includes 3 assignments
❖ The best case is that the list is ordered before sorting, and the number of exchanges is 0
❖ The worst case is that each comparison must be exchanged, and the number of exchanges is equal to the number of comparisons
❖ The average is half the worst
❖ Bubble sorting is usually used as a sorting algorithm with poor time efficiency as a benchmark for other algorithms.
❖ Its efficiency is mainly poor. Before finding its final location, each data item must be compared and exchanged many times, and most of the operations are invalid.
❖ But one advantage is that there is no additional storage overhead.
"""

# Performance improvement

"""
❖ In addition, by monitoring whether each comparison has been exchanged, it can determine whether the sorting is completed in advance
❖ This is what most other sorting algorithms cannot do
❖ If there is no exchange in a comparison, the list has been arranged, and the algorithm can be ended in advance"
"""
def shortBubbleSort(alist):
    exchanges = True
    passnum = len(alist)-1
    while passnum > 0 and exchanges:
        exchanges = False
        for i in range(passnum):
            if alist[i]>alist[i+1]:
                exchanges = True
                temp = alist[i]
                alist[i] = alist[i+1]
                alist[i+1] = temp
        passnum = passnum - 1
        
alist=[20,30,40,90,50, 60,70, 80, 100,110]
shortBubbleSort(alist)
print(alist)

# result
# [20, 30, 40, 50, 60, 70, 80, 90, 100, 110]

2. Select Sorting

"""
❖ Selective sorting improves bubble sorting, retains its basic multi pass comparison idea, and makes the current maximum item in place in each pass.
❖ However, sorting is selected to reduce the exchange. Compared with bubbling sorting, multiple exchanges are carried out, and only one exchange is carried out in each trip. The location of the largest item is recorded, and finally exchanged with the last item of this trip
❖ The time complexity of selective sorting is slightly better than bubble sorting
    The number of comparisons remains the same, or O(n2)
    The number of exchanges is reduced to O(n)
"""
def selectionSort(alist):
    for fillslot in range(len(alist)-1,0,-1):
        positionOfMax=0
        for location in range(1, fillslot+1):
            if alist[location]>alist[positionOfMax]:
                positionOfMax = location

        temp = alist[fillslot]
        alist[fillslot] = alist[positionOfMax]
        alist[positionOfMax] = temp

alist = [54,26,93,17,77,31,44,55,20]
selectionSort(alist)
print(alist)

# result
# [17, 20, 26, 31, 44, 54, 55, 77, 93]

3. Insertion Sort

"""
❖ In the first trip, the sub list only contains the first data item. Insert the second data item into the appropriate position of the sub list as a "new item", so that the sorted sub list contains two data items
❖ For the second time, continue to compare the third data item with the first two data items, move the data item larger than itself, and make space for adding to the sub list
❖ after n-1 Through comparison and insertion, the sub list is extended to the whole table, and the sorting is completed

❖ The comparison of insertion sort is mainly used to find the insertion position of "new item"
❖ The worst case is that each trip is compared with all items in the sub list. The total comparison times are the same as the bubble sort, and the order of magnitude is still the same O(n2)
❖ In the best case, when the list has been arranged, only one comparison is required for each trip, and the total number is O(n)
"""
def insertionSort(alist):
    for index in range(1,len(alist)):

        currentvalue = alist[index] #New item / insert item
        position = index

        while position>0 and alist[position-1]>currentvalue:
            alist[position]=alist[position-1]
            position = position-1    # Comparison and movement

        alist[position]=currentvalue # Insert new item


alist = [54,26,93,17,77,31,44,55,20]
insertionSort(alist)
print(alist)

# result
# [17, 20, 26, 31, 44, 54, 55, 77, 93]

4. Shell Sort

  • We note that the comparison times of insertion sorting are O(n) in the best case. This happens when the list is already ordered. In fact, the closer the list is to order, the fewer the comparison times of insertion sorting

  • Starting from this situation, Schell sort takes insertion sort as the basis, divides the unordered table into sub lists at an interval, and each sub list performs insertion sort

  • As the number of sub lists decreases, the whole of the unordered table becomes closer and closer to order, so as to reduce the comparison times of the overall sorting

  • For the sub list with an interval of 3, the overall situation of the sub list after being inserted and sorted is closer to order

  • The last one is the standard insertion sort, but because the list has been processed to near order in the previous several times, this one only needs a few moves to complete

  • The interval of sub list generally starts from n/2, and each trip is doubled: n/4, n/8... Until 1

def shellSort(alist):
    sublistcount = len(alist)//2 # interval setting
    while sublistcount > 0:

        for startposition in range(sublistcount): # Sub list sorting
            gapInsertionSort(alist , startposition, sublistcount)

        print("After increments of size" , sublistcount ,"The list is",alist)
        sublistcount = sublistcount // 2 # interval reduction

def gapInsertionSort (alist,start,gap):
    for i in range(start+gap,len(alist),gap):

        currentvalue = alist[i]
        position = i

        while position>=gap and alist [position-gap]>currentvalue:
            alist[position]=alist [position-gap]
            position = position-gap

        alist[position]=currentvalue

alist = [54,26,93,17,77,31,44,55,20]
shellSort(alist)

# result
"""
After increments of size 4 The list is [20, 26, 44, 17, 54, 31, 93, 55, 77]
After increments of size 2 The list is [20, 17, 44, 26, 54, 31, 77, 55, 93]
After increments of size 1 The list is [17, 20, 26, 31, 44, 54, 55, 77, 93]
"""
  • Roughly speaking, Schell sort is based on insertion sort and may not be better than insertion sort
  • However, because each trip brings the list closer to order, this process will reduce a lot of "invalid" comparisons originally required
    The detailed analysis of Schell sort is more complex, which is roughly between O(n) and O(n2)
  • If the interval is kept at 2k-1(1, 3, 5, 7, 15, 31, etc.), the time complexity of Schell sorting is about O (n3/2)

4. Merge Sort

  • Application of divide and conquer strategy in sorting
  • Merge sort is a recursive algorithm. The idea is to continuously split the data table into two halves and merge and sort the two halves respectively
    The basic end condition of recursion is that the data table has only one data item, which is naturally in good order;
    Reduce the scale: split the data table into two equal halves and reduce the scale to one-half of the original;
    Call itself: sort the two halves separately, and then sort the two halves separately
def mergeSort(alist):
    if len(alist)>1:# Basic end condition
        mid = len(alist)//2
        lefthalf = alist[ :mid]
        righthalf = alist[mid:]

        mergeSort(lefthalf) # Recursive call
        mergeSort(righthalf)

        i=j=k=0
        while i<len(lefthalf) and j<len(righthalf):#Zipper interleaving brings the left and right halves from small to large into the result list
            if lefthalf[i]<righthalf[j]:
                alist[k]=lefthalf[i]
                i=i+1
            else:
                alist[k]=righthalf[j]
                j=j+1
            k=k+1

        while i<len(lefthalf):#Merge left half remainder
            alist[k]=lefthalf[i]
            i=i+1
            k=k+1
        while j<len(righthalf):#Merge the remaining items in the right half
            alist[k]=righthalf[j]
            j=j+1
            k=k+1


alist = [54,26,93,17,77,31,44,55,20]
mergeSort(alist)
print(alist)


# result
# [17, 20, 26, 31, 44, 54, 55, 77, 93]

# Another merge sort code
def merge_sort(lst):
    #Recursive end condition
    if len(lst) <= 1:
        return lst

    #Decompose the problem and call recursively
    middle = len(lst) // 2
    left = merge_sort(lst[:middle]) #Order the left half
    right = merge_sort(lst[middle:]) # Order the right half

    #Merge the left and right halves to complete the sorting
    merged = []
    while left and right :
        if left[0] <= right[0]:
            merged. append(left.pop(0))
        else:
            merged. append(right . pop(0))
    merged.extend(right if right else left)
    return merged



alist = [54,26,93,17,77,31,44,55,20]
print(merge_sort(alist))

# result
# [17, 20, 26, 31, 44, 54, 55, 77, 93]
  • Merge sort: algorithm analysis
❖ The merging sort is divided into two processes: splitting and merging
❖ The splitting process, based on the analysis results in binary search, is logarithmic complexity, and the time complexity is O(log n)
❖ In the merging process, all data items will be compared and placed once relative to each part of the split, so it is a linear complexity and its time complexity is O(n)
    Considering comprehensively, each split part is carried out once O(n)The total time complexity of data item merging is O(nlog n)

❖ Finally, we note that the two slicing operations can be cancelled for the sake of accuracy of time complexity analysis,
   It's OK to pass the start and end points of the two split parts instead, but the readability of the algorithm is slightly sacrificed.
❖ We note that the merge sort algorithm uses an additional 1 times the storage space for merging ,This feature should be taken into account when sorting large data sets

5. Quick Sort

  • The idea of quick sort is to divide the data table into two halves according to a "median" data item: less than half of the median and more than half of the median, and then quickly sort each part (recursion)
    If you want the two halves to have an equal number of data items, you should find the "median" of the data table
    But finding the median needs to calculate the cost! If you want to have no cost, you can only find a number at will as the "median"
    For example, the first number.
  • Recursive algorithm for quick sorting "recursive three elements"
❖ Basic end condition: the data table has only one data item, which is naturally in good order
❖ Downsizing: divide the data sheet into two halves according to the "median", preferably two halves of the same size
❖ Call itself: sort the two halves separately
  (Sort basic operation (during splitting)
  • code
"""
❖ The goal of splitting the data table: find the position of the "median"
❖ Means of splitting data tables
    Set left and right markers( left/rightmark)
    The left marker moves to the right and the right marker moves to the left
        • The left marker moves all the way to the right, and stops when it encounters something larger than the median value
        • The right mark moves to the left and stops when it is smaller than the median value
        • Then exchange the data items indicated by the left and right marks
    Continue moving until the left marker moves to the right of the right marker, and stop moving
    At this time, the position indicated by the right mark is the position where the "median" should be
    Swap the median with this position
    After division, the left half is smaller than the median, and the right half is larger than the median
"""
def quickSort(alist):
    quickSortHelper(alist,0,len(alist)-1)

def quickSortHelper(alist, first,last):
    if first<last:   #Basic end condition

        splitpoint = partition(alist, first,last)# division

        quickSortHelper(alist , first , splitpoint-1)#Recursive call
        quickSortHelper( alist , splitpoint+1,last)

def partition(alist, first, last):
    pivotvalue = alist[first] #Select median“

    leftmark = first + 1 #Left and right initial values
    rightmark = last

    done = False
    while not done:
        while leftmark <= rightmark and alist[leftmark] <= pivotvalue:
            leftmark = leftmark + 1 #Move left marker to the right

        while alist[rightmark] >= pivotvalue and rightmark >= leftmark:
            rightmark = rightmark - 1 #Move right marker left

        if rightmark < leftmark: #The movement ends when the two marks are wrong
            done = True
        else: #Left and right standard value exchange
            temp = alist[leftmark]
            alist[leftmark] = alist[rightmark]
            alist[rightmark] = temp

    temp = alist[first] #Median in place
    alist[first] = alist[rightmark]
    alist[rightmark] = temp

    return rightmark # The median point is also the splitting point


alist = [54, 26, 93, 17, 77, 31, 44, 55, 20]
quickSort(alist)
print(alist)

# result
# [17, 20, 26, 31, 44, 54, 55, 77, 93]
  • Quick sort: algorithm analysis

❖ The quick sort process is divided into two parts: splitting and moving
    If splitting always divides the data table into two equal parts, it is O(logn)Complexity of;
    And mobile needs to compare each item with the median, or O(n)
❖ Taken together O(nlog n);
❖ Moreover, no additional storage space is required during the operation of the algorithm.

❖ However, if not so lucky, the split point of the median is too far away from the middle, resulting in an imbalance between the left and right parts
❖ In extreme cases, there is always no data in some parts, so the time complexity degrades to O(n2)
    Plus the overhead of recursive calls (worse than bubble sorting)

❖ The selection method of the lower median can be appropriately improved to make the median more representative
    For example, "three-point sampling", select the median value from the head, tail and middle of the data table
    There will be additional computational overhead, and extreme cases cannot be ruled out

Algorithm complexity summary

  • The time complexity of sequential search on unordered or ordered tables is O(n)

  • The worst complexity of binary search on ordered table is O(log n)

  • Hash table can realize constant time search

  • Perfect hash function is widely used as data consistency verification

  • Bubbling, selection and insertion sorting are O(n2) algorithms

  • Schell sort is improved on the basis of insertion sort. The method of sorting incremental sub tables is adopted, and its time complexity can be between O(n) and O(n2)

  • The time complexity of merging sorting is O(nlog n), but the merging process requires additional storage space

  • The best time complexity of quick sorting is O(nlog n), and no additional storage space is required. However, if the splitting point deviates from the center of the list, it will degenerate to O(n2) in the worst case

Posted by renzaho on Fri, 22 Oct 2021 02:47:21 -0700