The basic steps of the crawler: get the url, parse the web page and find the required content; data processing; save.
Knowledge points:
(1) generally, the url of the same type of web page has certain rules, so you need to pay attention to observation when crawling, such as scenic spots
Second pages: https://travel.qunar.com/p-cs299878-shanghai-jingdian-1-2
Third pages: https://travel.qunar.com/p-cs299878-shanghai-jingdian-1-3
(2) the url can be obtained through requests.get (url) or urllib2.urlopen(url). Requests are subfunctions under the urllib2 package.
(3) the focus of the crawler is how to parse the content returned after access. Take html (xml) and json as examples:
A. format HTML, XML: beautiful soup4 is a python library that extracts data from HTML or XML files. It changes the content into a specific structure, making each node become a python object. To locate and extract the content using find or find all according to HTML tags (such as head, title), css selectors (class tags), attributes, attribute values, etc.
There are four ways to parse bs4:
'html.parser' has moderate parsing speed and strong fault tolerance
’lxml 'has fast parsing speed and strong fault tolerance
’Fast parsing speed, the only parser supporting xml
’html5lib 'parsing speed is slow, with the best fault tolerance
b. format json: you can use python's own json package to turn content into a dictionary for reference
url = 'https://restapi.amap.com/v3/geocode/geo?address = Wuhan & key = a0325cbd9f7ab7eeb1bdb16ca78922b2 ' temp1 = urllib2.urlopen(url).read() temp2 = requests.get(url).text temptext1 = json.loads(temp1) temptext2 = json.loads(temp2) address = temptext1['geocodes'][0]['location']
(4) content positioning: open the web source, select the element view, and the code corresponds to the web plate.
Take the information of which website to visit as an example to crawl the information of Shanghai scenic spots. Code attached.
import requests from bs4 import BeautifulSoup import numpy as np import pandas as pd ''' 1,data acquisition ''' def get_urls(n): return ['https://travel.qunar.com/p-cs299878-shanghai-jingdian-1-' + str(i+1) for i in range(n)] # Create function to get paging URL def get_informations(u): ri = requests.get(u) # requests visit the website soupi = BeautifulSoup(ri.text,'lxml') # bs uses lxml to parse the page infori = soupi.find('ul',class_="list_item clrfix").find_all('li') # Get the list content according to the css tag of the web page datai = [] n=0 for i in infori: n+=1 #print(i.text) dic = {} dic['lat'] = i['data-lat'] dic['lng'] = i['data-lng'] dic['Name of scenic spot'] = i.find('span',class_="cn_tit").text dic['Quantity mentioned in strategy'] = i.find('div',class_="strategy_sum").text dic['Quantity of comments'] = i.find('div',class_="comment_sum").text dic['Scenic spots ranking'] = i.find('span',class_="ranking_sum").text dic['Star class'] = i.find('span',class_="total_star").find('span')['style'].split(':')[1] datai.append(dic) # Get field contents respectively #print('collected% s data '% (n*10)) return datai # Build page crawler url_lst = get_urls(5) # Get 5 Web addresses df = pd.DataFrame() for u in url_lst: dfi = pd.DataFrame(get_informations(u)) print dfi df = pd.concat([df, dfi]) df.reset_index(inplace=True, drop=True) # Data collection ''' 2,Field filtering and data cleaning ''' df['lng'] = df['lng'].astype(np.float) df['lat'] = df['lat'].astype(np.float) df['Quantity of comments'] = df['Quantity of comments'].astype(np.int) df['Quantity mentioned in strategy'] = df['Quantity mentioned in strategy'].astype(np.int) # Field type processing df['Star class'] = df['Star class'].str.replace('%','').astype(np.float) # Star field processing df['Scenic spots ranking'] = df['Scenic spots ranking'].str.split('The first').str[1] df['Scenic spots ranking'].fillna(value = 0,inplace = True) ''' 3,export ''' df.to_excel('JD.xlsx')