Python 3 implementation and improvement of Apriori algorithm
Code reference machine learning practice
The improved methods are partly from data mining: concepts and technologies, and partly from
https://blog.csdn.net/weixin_30702887/article/details/98992919
I summarize and implement here, and record my learning of Apriori algorithm
First, select a candidate item set from the transaction database
def createCDDSet(dataSet): C=[ ] for tid in dataSet: for item in tid: if not [item] in C: C.append([item]) C.sort() return list(map(frozenset,C))
Here, frozenset is used to prevent candidate itemsets and frequent itemsets from being modified
Then, frequent itemsets LK are continuously filtered from the transaction database
def scanD(dataSet,CK,minsupport):#Accept candidate k itemsets and output frequent k itemsets ssCnt={} for tid in dataSet: for can in CK: if can.issubset(tid): if not can in ssCnt :ssCnt[can] =1 else:ssCnt[can] +=1 numItem=float(len(dataSet)) retList=[] supportData={} for key in ssCnt: support=float(ssCnt[key]/numItem) if support >= minsupport: retList.insert(0,key) supportData[key] =support#Update support dictionary return retList,supportData
Next, create candidate itemsets (candidate k+1 itemsets are generated from frequent k itemsets)
retList=[] lenLK=len(LK) for i in range(lenLK): for j in range(i+1,lenLK): L1= list(LK[i])[:k-2] L2=list(LK[j])[:k-2] L1.sort() L2.sort() if L1==L2: retList.append(LK[i]|LK[j]) return retList
Finally, the main body of Apriori algorithm
def apriori(dataSet,minsupport): C1=createCDDSet(dataSet) D=set() for tid in dataSet: tid=frozenset(tid) D.add(tid) L1,supportData =scanD(D,C1,minsupport,numItem) L=[L1] k=2 while (len(L[k-2])>0): CK=aprioriGen(L[k-2],k) LK,supK=scanD(D,CK,minsupport,k) supportData.update(supK) L.append(LK) k +=1 L = [i for i in L if i]#Delete empty list return L,supportData
At this point, the main body of our Apriori algorithm is completed
The next step is to realize the mining of association rules
Import of first row data
The sample I use here is
a,c,e b,d b,c a,b,c,d a,b b,c a,b a,b,c,e a,b,c a,c,e
Is txt text
I use pandas
import pandas as pd name='menu_orders.txt' minsupport=0.2 minconfidence=0.5 def createData(name):#Preprocess the data and output the dataset D=pd.read_csv(name, header=None,index_col=False,names=['1','2','3']) D=D.fillna(0) D=D.values.tolist() for i in range(len(D)): D[i]=[j for j in D[i] if j !=0] return D
Then calculate the confidence and write the association rules. Only when the confidence is greater than the threshold, the association rule can be output.
def calculate(dataset):#Calculation of support and confidence by algorithm dataset,dic=Apriori_self.apriori(dataset,minsupport) Rname = [] Rsupport = [] Rconfidence = [] emptylist = [] for i in range(len(dataset)): for AB in dataset[i]: for A in emptylist: if A.issubset(AB): conf = dic.get(AB) / dic.get(AB - A) if conf >= minconfidence: Rname.append(str(AB - A) + '-->' + str(A)) Rconfidence.append(conf) Rsupport.append(dic.get(AB)) emptylist.append(AB) return Rname,Rsupport,Rconfidence
Here, the set subtraction method is used for the combination of association rules to avoid the trouble of free combination from frequent itemsets
Final output and write to txt file
def outputdata(Rname,Rsupport,Rconfidence): data = { "Association rules": Rname, "Support": Rsupport, "Confidence": Rconfidence } df = pd.DataFrame(data, columns=['Association rules', 'Support', 'Confidence']) return df dataset=createData(name) R1,R2,R3=calculate(dataset) df=outputdata(R1,R2,R3) df.to_csv('Report.txt')
Let's look at the final result
,Association rules,Support,Confidence 0,frozenset({'e'})-->frozenset({'a'}),0.2,1.0 1,frozenset({'e'})-->frozenset({'c'}),0.2,1.0 2,frozenset({'d'})-->frozenset({'b'}),0.2,1.0 3,frozenset({'a'})-->frozenset({'b'}),0.4,0.6666666666666667 4,frozenset({'b'})-->frozenset({'a'}),0.4,0.5 5,frozenset({'a'})-->frozenset({'c'}),0.4,0.6666666666666667 6,frozenset({'c'})-->frozenset({'a'}),0.4,0.6666666666666667 7,frozenset({'b'})-->frozenset({'c'}),0.4,0.5 8,frozenset({'c'})-->frozenset({'b'}),0.4,0.6666666666666667 9,"frozenset({'a', 'c'})-->frozenset({'e'})",0.2,0.5 10,"frozenset({'a', 'e'})-->frozenset({'c'})",0.2,1.0 11,"frozenset({'c', 'e'})-->frozenset({'a'})",0.2,1.0 12,"frozenset({'e'})-->frozenset({'a', 'c'})",0.2,1.0 13,"frozenset({'a', 'b'})-->frozenset({'c'})",0.2,0.5 14,"frozenset({'a', 'c'})-->frozenset({'b'})",0.2,0.5 15,"frozenset({'b', 'c'})-->frozenset({'a'})",0.2,0.5
Optimization of Apriori algorithm
1: The determined infrequent itemsets can be deleted directly in the transaction database to avoid multiple scans and reduce I/O overhead
2: For frequent K itemsets, if a single item i appears less than k times, the itemset containing i cannot appear in the frequent k+1 itemset. Items containing a single i should be deleted from the frequent K itemset and then linked
The optimized complete code is
def createCDDSet(dataSet): C=[ ] for tid in dataSet: for item in tid: if not [item] in C: C.append([item]) C.sort() return list(map(frozenset,C)) def scanD(dataSet,CK,minsupport,numItem,k=0):#Accept candidate k itemsets and output frequent k itemsets ssCnt={} for tid in dataSet: for can in CK: if can.issubset(tid): if not can in ssCnt :ssCnt[can] =1 else:ssCnt[can] +=1 #numItem=float(len(dataSet)) #Move into apriori method to avoid multiple calculations and increase complexity retList=[] supportData={} for key in ssCnt: support=float(ssCnt[key]/numItem) if support >= minsupport: retList.insert(0,key) # Scan D again and delete the infrequent k itemset. This improvement is to compress the transaction database and reduce the scanned data else: for tid in dataSet: if key==tid: dataSet.remove(tid) supportData[key] =support R_List=[] # For frequent K itemsets, if a single item i appears less than k times, i cannot appear in the frequent k+1 itemset. Items containing a single i should be deleted from the frequent K itemset and then linked # Improvement direction: compression candidate set CK if k > 1: ssCnt = {} for tid in retList: for key in tid: if not key in ssCnt: ssCnt[key] = 1 else: ssCnt[key] += 1 tids = [] for tid in retList: for item in tid: if item in ssCnt.keys(): if ssCnt[item] < k: tids.append(tid) R_List = list(set(retList) - set(tids)) print('Frequent itemsets before optimization'+str(retList)+' '+'Optimized frequent itemsets'+str(R_List)) return retList,supportData,R_List def aprioriGen(LK,k,RK):#Create candidate item set CK, where k is the number of elements in the output set if RK: LK=RK else: pass retList=[] lenLK=len(LK) for i in range(lenLK): for j in range(i+1,lenLK): L1= list(LK[i])[:k-2] L2=list(LK[j])[:k-2] L1.sort() L2.sort() if L1==L2: retList.append(LK[i]|LK[j]) return retList def apriori(dataSet,minsupport): C1=createCDDSet(dataSet) D=set() for tid in dataSet: tid=frozenset(tid) D.add(tid) numItem = float(len(D)) #This is the only numItem to be calculated. Otherwise, the element of data list D will be deleted by scanD method, resulting in the change of numItem L1,supportData,R1 =scanD(D,C1,minsupport,numItem) L=[L1] R=[R1] k=2 while (len(L[k-2])>0): CK=aprioriGen(L[k-2],k,R[k-2]) LK,supK,RK=scanD(D,CK,minsupport,numItem,k) supportData.update(supK) L.append(LK) R.append(RK) k +=1 L = [i for i in L if i]#Delete empty list return L,supportData
Time comparison before and after optimization
Before optimization
After optimization
Input frequent binomial set of aprioriGen before optimization
[frozenset({'e', 'a'}), frozenset({'c', 'e'}), frozenset({'c', 'a'}), frozenset({'c', 'b'}), frozenset({'b', 'd'}), frozenset({'b', 'a'})]
After optimization
[frozenset({'b', 'a'}), frozenset({'c', 'a'}), frozenset({'c', 'e'}), frozenset({'e', 'a'}), frozenset({'c', 'b'})]
You can see that one item is reduced after optimization
summary
For Apriori algorithm, there are many optimization methods.
- Save itemsets using hash tables
- Reduce database scanning
- Using the partition method, we first find the local frequent itemset, and then find the global frequent itemset from the local frequent itemset. We only need to scan the transaction database twice
- Sampling # but reduced accuracy
- For frequent K itemsets, if a single item i appears less than k times, the item containing i cannot appear in the frequent k+1 itemset. The item containing a single i should be deleted from the frequent K itemset and then linked
- Dynamic itemset count