Python 3 implementation and improvement of Apriori algorithm

Keywords: Python Database Algorithm Data Mining

Python 3 implementation and improvement of Apriori algorithm

Code reference machine learning practice
The improved methods are partly from data mining: concepts and technologies, and partly from
https://blog.csdn.net/weixin_30702887/article/details/98992919
I summarize and implement here, and record my learning of Apriori algorithm

First, select a candidate item set from the transaction database

def createCDDSet(dataSet):
    C=[ ]
    for tid in dataSet:
        for item in tid:
            if not [item] in C:
                C.append([item])
    C.sort()
    return list(map(frozenset,C))

Here, frozenset is used to prevent candidate itemsets and frequent itemsets from being modified
Then, frequent itemsets LK are continuously filtered from the transaction database

def scanD(dataSet,CK,minsupport):#Accept candidate k itemsets and output frequent k itemsets
    ssCnt={}
    for tid in dataSet:
        for can in CK:
            if can.issubset(tid):
                if not  can in ssCnt :ssCnt[can] =1
                else:ssCnt[can] +=1
    numItem=float(len(dataSet)) 
    retList=[]
    supportData={}
    for key in ssCnt:
        support=float(ssCnt[key]/numItem)
        if support >= minsupport:
            retList.insert(0,key)
        supportData[key] =support#Update support dictionary
    return retList,supportData

Next, create candidate itemsets (candidate k+1 itemsets are generated from frequent k itemsets)

 retList=[]
    lenLK=len(LK)
    for i in range(lenLK):
        for j in range(i+1,lenLK):
            L1= list(LK[i])[:k-2]
            L2=list(LK[j])[:k-2]
            L1.sort()
            L2.sort()
            if L1==L2:
                retList.append(LK[i]|LK[j])
    return retList

Finally, the main body of Apriori algorithm

def apriori(dataSet,minsupport):
    C1=createCDDSet(dataSet)
    D=set()
    for tid in dataSet:
        tid=frozenset(tid)
        D.add(tid)
    L1,supportData =scanD(D,C1,minsupport,numItem)
    L=[L1]
    k=2
    while (len(L[k-2])>0):
        CK=aprioriGen(L[k-2],k)
        LK,supK=scanD(D,CK,minsupport,k)
        supportData.update(supK)
        L.append(LK)
        k +=1
    L = [i for i in L if i]#Delete empty list
    return L,supportData

At this point, the main body of our Apriori algorithm is completed
The next step is to realize the mining of association rules
Import of first row data
The sample I use here is

a,c,e
b,d
b,c
a,b,c,d
a,b
b,c
a,b
a,b,c,e
a,b,c
a,c,e

Is txt text
I use pandas

import pandas as pd
name='menu_orders.txt'
minsupport=0.2
minconfidence=0.5
def createData(name):#Preprocess the data and output the dataset

    D=pd.read_csv(name, header=None,index_col=False,names=['1','2','3'])

    D=D.fillna(0)
    D=D.values.tolist()
    for i in range(len(D)):
        D[i]=[j for j in D[i] if j !=0]
    return D

Then calculate the confidence and write the association rules. Only when the confidence is greater than the threshold, the association rule can be output.

def calculate(dataset):#Calculation of support and confidence by algorithm
    dataset,dic=Apriori_self.apriori(dataset,minsupport)

    Rname = []
    Rsupport = []
    Rconfidence = []
    emptylist = []
    for i in range(len(dataset)):
        for AB in dataset[i]:
            for A in emptylist:
                if A.issubset(AB):
                    conf = dic.get(AB) / dic.get(AB - A)
                    if conf >= minconfidence:
                        Rname.append(str(AB - A) + '-->' + str(A))
                        Rconfidence.append(conf)
                        Rsupport.append(dic.get(AB))

            emptylist.append(AB)
    return Rname,Rsupport,Rconfidence

Here, the set subtraction method is used for the combination of association rules to avoid the trouble of free combination from frequent itemsets
Final output and write to txt file

def outputdata(Rname,Rsupport,Rconfidence):
    data = {

        "Association rules": Rname,
        "Support": Rsupport,
        "Confidence": Rconfidence
    }
    df = pd.DataFrame(data,
                      columns=['Association rules', 'Support', 'Confidence'])


    return df
dataset=createData(name)
R1,R2,R3=calculate(dataset)
df=outputdata(R1,R2,R3)
df.to_csv('Report.txt')
    

Let's look at the final result

,Association rules,Support,Confidence
0,frozenset({'e'})-->frozenset({'a'}),0.2,1.0
1,frozenset({'e'})-->frozenset({'c'}),0.2,1.0
2,frozenset({'d'})-->frozenset({'b'}),0.2,1.0
3,frozenset({'a'})-->frozenset({'b'}),0.4,0.6666666666666667
4,frozenset({'b'})-->frozenset({'a'}),0.4,0.5
5,frozenset({'a'})-->frozenset({'c'}),0.4,0.6666666666666667
6,frozenset({'c'})-->frozenset({'a'}),0.4,0.6666666666666667
7,frozenset({'b'})-->frozenset({'c'}),0.4,0.5
8,frozenset({'c'})-->frozenset({'b'}),0.4,0.6666666666666667
9,"frozenset({'a', 'c'})-->frozenset({'e'})",0.2,0.5
10,"frozenset({'a', 'e'})-->frozenset({'c'})",0.2,1.0
11,"frozenset({'c', 'e'})-->frozenset({'a'})",0.2,1.0
12,"frozenset({'e'})-->frozenset({'a', 'c'})",0.2,1.0
13,"frozenset({'a', 'b'})-->frozenset({'c'})",0.2,0.5
14,"frozenset({'a', 'c'})-->frozenset({'b'})",0.2,0.5
15,"frozenset({'b', 'c'})-->frozenset({'a'})",0.2,0.5

Optimization of Apriori algorithm
1: The determined infrequent itemsets can be deleted directly in the transaction database to avoid multiple scans and reduce I/O overhead
2: For frequent K itemsets, if a single item i appears less than k times, the itemset containing i cannot appear in the frequent k+1 itemset. Items containing a single i should be deleted from the frequent K itemset and then linked
The optimized complete code is

def createCDDSet(dataSet):
    C=[ ]
    for tid in dataSet:
        for item in tid:
            if not [item] in C:
                C.append([item])
    C.sort()
    return list(map(frozenset,C))

def scanD(dataSet,CK,minsupport,numItem,k=0):#Accept candidate k itemsets and output frequent k itemsets
    ssCnt={}
    for tid in dataSet:
        for can in CK:
            if can.issubset(tid):
                if not  can in ssCnt :ssCnt[can] =1
                else:ssCnt[can] +=1
    #numItem=float(len(dataSet)) #Move into apriori method to avoid multiple calculations and increase complexity
    retList=[]
    supportData={}
    for key in ssCnt:
        support=float(ssCnt[key]/numItem)
        if support >= minsupport:
            retList.insert(0,key)

        # Scan D again and delete the infrequent k itemset. This improvement is to compress the transaction database and reduce the scanned data
        else:
            
            for tid in dataSet:
                if key==tid:   dataSet.remove(tid)

        supportData[key] =support

    R_List=[]
    # For frequent K itemsets, if a single item i appears less than k times, i cannot appear in the frequent k+1 itemset. Items containing a single i should be deleted from the frequent K itemset and then linked
    # Improvement direction: compression candidate set CK
    if k > 1:
        ssCnt = {}
        for tid in retList:
            for key in tid:
                if not key in ssCnt:
                    ssCnt[key] = 1
                else:
                    ssCnt[key] += 1
        tids = []
        for tid in retList:
            for item in tid:
                if item in ssCnt.keys():
                    if ssCnt[item] < k:
                        tids.append(tid)
        R_List = list(set(retList) - set(tids))





    

    print('Frequent itemsets before optimization'+str(retList)+'              '+'Optimized frequent itemsets'+str(R_List))
    return retList,supportData,R_List

def aprioriGen(LK,k,RK):#Create candidate item set CK, where k is the number of elements in the output set

    if RK:
        LK=RK
    else:
        pass

    retList=[]
    lenLK=len(LK)
    for i in range(lenLK):
        for j in range(i+1,lenLK):
            L1= list(LK[i])[:k-2]
            L2=list(LK[j])[:k-2]
            L1.sort()
            L2.sort()
            if L1==L2:
                retList.append(LK[i]|LK[j])
    return retList

def apriori(dataSet,minsupport):
    C1=createCDDSet(dataSet)
    D=set()
    for tid in dataSet:
        tid=frozenset(tid)
        D.add(tid)
    numItem = float(len(D))  #This is the only numItem to be calculated. Otherwise, the element of data list D will be deleted by scanD method, resulting in the change of numItem
    L1,supportData,R1 =scanD(D,C1,minsupport,numItem)
    L=[L1]
    R=[R1]
    k=2
    while (len(L[k-2])>0):
        CK=aprioriGen(L[k-2],k,R[k-2])
        LK,supK,RK=scanD(D,CK,minsupport,numItem,k)
        supportData.update(supK)
        L.append(LK)
        R.append(RK)
        k +=1
    L = [i for i in L if i]#Delete empty list
    return L,supportData

Time comparison before and after optimization
Before optimization

After optimization

Input frequent binomial set of aprioriGen before optimization

[frozenset({'e', 'a'}), frozenset({'c', 'e'}), frozenset({'c', 'a'}), frozenset({'c', 'b'}), frozenset({'b', 'd'}), frozenset({'b', 'a'})]

After optimization

[frozenset({'b', 'a'}), frozenset({'c', 'a'}), frozenset({'c', 'e'}), frozenset({'e', 'a'}), frozenset({'c', 'b'})]

You can see that one item is reduced after optimization

summary

For Apriori algorithm, there are many optimization methods.

  • Save itemsets using hash tables
  • Reduce database scanning
  • Using the partition method, we first find the local frequent itemset, and then find the global frequent itemset from the local frequent itemset. We only need to scan the transaction database twice
  • Sampling # but reduced accuracy
  • For frequent K itemsets, if a single item i appears less than k times, the item containing i cannot appear in the frequent k+1 itemset. The item containing a single i should be deleted from the frequent K itemset and then linked
  • Dynamic itemset count

Posted by narch31 on Wed, 13 Oct 2021 21:22:17 -0700