1, Search engine and recommendation system
From the perspective of information acquisition, search and recommendation are two main means for users to obtain information. Both search and recommendation coexist on the Internet and offline. Search is an active behavior, and the needs of users are very clear. In the results provided by search engines, users can clearly judge whether they meet the needs of users by browsing and clicking. However, the recommendation system receives information passively, and the requirements are vague and unclear. As shown in Figure 1, search engine and recommendation system are two different ways to obtain information.
Although there are many differences between search and recommendation, they are both application branches of big data technology, with a lot of overlap. The recommendation system also makes extensive use of the technology of search engine. An important data structure for search engine to solve the operation performance is inverted index technology. In the recommendation system, an important algorithm is content-based recommendation, in which a large number of methods such as inverted index, query and result merging are used. They are the wisdom crystallization of data mining technology, information retrieval technology, computational statistics and other long-standing disciplines. Figure 2 is an example of the combination of Baidu search engine and recommendation system. The circle part is the content of the recommendation system.
2, Principle and algorithm of recommendation system
Recommendation system has three important models: user model, recommendation object module and recommendation algorithm model. The recommendation system matches the interest demand information in the user model with the feature information in the recommendation object model, and uses the corresponding recommendation algorithm to calculate and filter, find the recommendation object that the user may be interested in, and then recommend it to the user. Commonly used recommendation algorithms include Jaccard coefficient, cosine similarity, Pearson coefficient, etc.
Jaccard coefficient is equal to the ratio of the number of intersection sets of sample sets to the number of union sets of sample sets, expressed by J(A,B). When the definition set (A,B) is empty, J(A,B)=1. The main application scenarios of Jaccard coefficient are:
- Filter news with high similarity, or remove duplicate pages;
- Examination anti cheating system;
- Paper duplicate checking system.
# -*- coding: utf-8 -*- """ Created on Thu Jul 2 22:52:58 2020 @author: zcq """ import jieba def Jaccrad(model, reference):#terms_reference is the source sentence, terms_model is a candidate sentence terms_reference= jieba.cut(reference)#Default precision mode terms_model= jieba.cut(model) grams_reference = set(terms_reference)#duplicate removal; If not, change to list grams_model = set(terms_model) temp=0 for i in grams_reference: if i in grams_model: temp=temp+1 fenmu=len(grams_model)+len(grams_reference)-temp #Union jaccard_coefficient=float(temp/fenmu)#intersection return jaccard_coefficient a="Shannon's information entropy is defined as the expectation of self information" b="Information entropy is the expectation of self information" jaccard_coefficient=Jaccrad(a,b) print(jaccard_coefficient)
Figure 2 shows the Jaccard coefficients of two sets a and B implemented by Python, with a similarity of 0.3846.
Python run result:
0.38461538461538464
In this paper, cosine similarity is used to calculate the distance between users and each film. The higher the cosine similarity, the higher the user's preference for movies. Ua refers to the user's preference for movie a; Ia refers to whether the film belongs to type A. Next, let's introduce the ways to obtain data sets. One is web directed crawler, the other is public data set.
3, Data directed crawling and movie dataset
3.1 crawl the weather forecast data of recent seven days into DB database, which is completed in five steps.
Step 1: get Xi'an weather forecast website http://www.weather.com.cn/weather/101110101.shtml;
Step 2: introduce sqlite3, beautiful soup and urlib.request components to implement WeatherDB and WeatherForecast classes;
Step 3: implement the database of WeatherDB class, create openDB, insert, query show, and close the closeDB method;
Step 4: initialize init, city forecast and batch forecast process of WeatherForecast class;
Step 5: Python Programming (see the following figure for program code)
# -*- coding: utf-8 -*- """ Created on Sat Jun 27 09:18:48 2020 @author: zcq """ from bs4 import BeautifulSoup from bs4 import UnicodeDammit import urllib.request import sqlite3 class WeatherDB: def openDB(self): self.con=sqlite3.connect("weathers.db") self.cursor=self.con.cursor() try: self.cursor.execute("create table weathers (wCity varchar(16),wDate varchar(16),wWeather varchar(64),wTemp varchar(32),constraint pk_weather primary key(wCity,wDate))") except: self.cursor.execute("delete from weathers") def closeDB(self): self.con.commit() self.con.close() def insert(self,city,date,weather,temp): try: self.cursor.execute("insert into weathers (wCity,wDate,wWeather,wTemp) values (?,?,?,?)" ,(city,date,weather,temp)) except Exception as err: print(err) def show(self): self.cursor.execute("select * from weathers") rows=self.cursor.fetchall() print("%-16s%-16s%-32s%-16s" % ("city","date","weather","temp")) for row in rows: print("%-16s%-16s%-32s%-16s" % (row[0],row[1],row[2],row[3])) class WeatherForecast: def __init__(self): self.headers = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre"} self.cityCode={"Beijing":"101010100","Shanghai":"101020100","Guangzhou":"101280101","Shenzhen":"101280601","Xi'an":"101110101"} def forecastCity(self,city): if city not in self.cityCode.keys(): print(city+" code cannot be found") return url="http://www.weather.com.cn/weather/"+self.cityCode[city]+".shtml" try: req=urllib.request.Request(url,headers=self.headers) data=urllib.request.urlopen(req) data=data.read() dammit=UnicodeDammit(data,["utf-8","gbk"]) data=dammit.unicode_markup soup=BeautifulSoup(data,"lxml") lis=soup.select("ul[class='t clearfix'] li") for li in lis: try: date=li.select('h1')[0].text weather=li.select('p[class="wea"]')[0].text temp=li.select('p[class="tem"] span')[0].text+"/"+li.select('p[class="tem"] i')[0].text print(city,date,weather,temp) self.db.insert(city,date,weather,temp) except Exception as err: print(err) except Exception as err: print(err) def process(self,cities): self.db=WeatherDB() self.db.openDB() for city in cities: self.forecastCity(city) #self.db.show() self.db.closeDB() ws=WeatherForecast() ws.process(["Beijing","Shanghai","Guangzhou","Shenzhen","Xi'an"]) print("completed")
Operation results:
Beijing 3 (tomorrow) thunderstorm turns cloudy 25℃/20℃ Beijing 4 (the day after tomorrow) thunderstorm 28℃/20℃ Beijing 5 (Sunday) thunderstorm 29℃/21℃ Beijing 6 (Monday) cloudy 31℃/23℃ Beijing 7 (Tuesday) thunderstorm 32℃/22℃ Beijing 8 (Wednesday) cloudy 30℃/22℃ Shanghai 3 (tomorrow) moderate rain to overcast 26℃/23℃ Shanghai 4 (the day after tomorrow) light rain to heavy rain 28℃/24℃ Shanghai 5 (Sunday) moderate rain 29℃/24℃ Shanghai 6 (Monday) light rain 29℃/25℃ Shanghai 7 (Tuesday) drizzle 30℃/26℃ Shanghai 8 (Wednesday) overcast to light rain 29℃/26℃ Guangzhou 3 (tomorrow) thunderstorm 33℃/27℃ Guangzhou 4 (the day after tomorrow) thunderstorm 33℃/28℃ Guangzhou 5 (Sunday) sunny 34℃/28℃ Guangzhou 6 (Monday) sunny 35℃/28℃ Guangzhou 7 (Tuesday) sunny 35℃/28℃ Guangzhou 8 (Wednesday) sunny 35℃/28℃ Shenzhen 3 (tomorrow) thunderstorm 32℃/27℃ Shenzhen 4 (the day after tomorrow) thunderstorm 32℃/27℃ Shenzhen 5 (Sunday) thunderstorm to shower 32℃/27℃ Shenzhen 6 (Monday) shower 33℃/28℃ Shenzhen 7 (Tuesday) shower 33℃/28℃ Shenzhen 8 (Wednesday) shower 33℃/28℃ Xi'an on the 3rd (tomorrow) cloudy to overcast 35℃/23℃ Xi'an 4 (the day after tomorrow) cloudy 33℃/22℃ Xi'an 5 (Sunday) sunny 36℃/23℃ Xi'an 6 (Monday) sunny to cloudy 37℃/22℃ Xi'an 7 (Tuesday) sunny 37℃/22℃ Xi'an on the 8th (Wednesday) cloudy to overcast 33℃/21℃ Completed
3.2 crawl the Douban movie dataset and store it in CSV file in four steps.
Step 1: get Douban movie Top 250: Douban movie Top 250
Step 2: import requests_html ,csv
Step 3: save the crawling results to the file: Douban top251.csv
Step 4: Python Programming
# -*- coding: utf-8 -*- """ Created on Sat Jun 27 09:18:48 2020 @author: zcq """ from bs4 import BeautifulSoup from bs4 import UnicodeDammit import urllib.request import sqlite3 class WeatherDB: def openDB(self): self.con=sqlite3.connect("weathers.db") self.cursor=self.con.cursor() try: self.cursor.execute("create table weathers (wCity varchar(16),wDate varchar(16),wWeather varchar(64),wTemp varchar(32),constraint pk_weather primary key(wCity,wDate))") except: self.cursor.execute("delete from weathers") def closeDB(self): self.con.commit() self.con.close() def insert(self,city,date,weather,temp): try: self.cursor.execute("insert into weathers (wCity,wDate,wWeather,wTemp) values (?,?,?,?)" ,(city,date,weather,temp)) except Exception as err: print(err) def show(self): self.cursor.execute("select * from weathers") rows=self.cursor.fetchall() print("%-16s%-16s%-32s%-16s" % ("city","date","weather","temp")) for row in rows: print("%-16s%-16s%-32s%-16s" % (row[0],row[1],row[2],row[3])) class WeatherForecast: def __init__(self): self.headers = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre"} self.cityCode={"Beijing":"101010100","Shanghai":"101020100","Guangzhou":"101280101","Shenzhen":"101280601","Xi'an":"101110101"} def forecastCity(self,city): if city not in self.cityCode.keys(): print(city+" code cannot be found") return url="http://www.weather.com.cn/weather/"+self.cityCode[city]+".shtml" try: req=urllib.request.Request(url,headers=self.headers) data=urllib.request.urlopen(req) data=data.read() dammit=UnicodeDammit(data,["utf-8","gbk"]) data=dammit.unicode_markup soup=BeautifulSoup(data,"lxml") lis=soup.select("ul[class='t clearfix'] li") for li in lis: try: date=li.select('h1')[0].text weather=li.select('p[class="wea"]')[0].text temp=li.select('p[class="tem"] span')[0].text+"/"+li.select('p[class="tem"] i')[0].text print(city,date,weather,temp) self.db.insert(city,date,weather,temp) except Exception as err: print(err) except Exception as err: print(err) def process(self,cities): self.db=WeatherDB() self.db.openDB() for city in cities: self.forecastCity(city) #self.db.show() self.db.closeDB() ws=WeatherForecast() ws.process(["Beijing","Shanghai","Guangzhou","Shenzhen","Xi'an"]) print("completed")