Getting started with python crawler

Keywords: JSON xml Python Attribute

The basic steps of the crawler: get the url, parse the web page and find the required content; data processing; save.
Knowledge points:
(1) generally, the url of the same type of web page has certain rules, so you need to pay attention to observation when crawling, such as scenic spots
Second pages: https://travel.qunar.com/p-cs299878-shanghai-jingdian-1-2
Third pages: https://travel.qunar.com/p-cs299878-shanghai-jingdian-1-3
(2) the url can be obtained through requests.get (url) or urllib2.urlopen(url). Requests are subfunctions under the urllib2 package.
(3) the focus of the crawler is how to parse the content returned after access. Take html (xml) and json as examples:
A. format HTML, XML: beautiful soup4 is a python library that extracts data from HTML or XML files. It changes the content into a specific structure, making each node become a python object. To locate and extract the content using find or find all according to HTML tags (such as head, title), css selectors (class tags), attributes, attribute values, etc.

There are four ways to parse bs4:
'html.parser' has moderate parsing speed and strong fault tolerance
’lxml 'has fast parsing speed and strong fault tolerance
’Fast parsing speed, the only parser supporting xml
’html5lib 'parsing speed is slow, with the best fault tolerance

b. format json: you can use python's own json package to turn content into a dictionary for reference

url = 'https://restapi.amap.com/v3/geocode/geo?address = Wuhan & key = a0325cbd9f7ab7eeb1bdb16ca78922b2 '
temp1 = urllib2.urlopen(url).read()
temp2 = requests.get(url).text
temptext1 = json.loads(temp1)
temptext2 = json.loads(temp2)
address = temptext1['geocodes'][0]['location']

(4) content positioning: open the web source, select the element view, and the code corresponds to the web plate.

Take the information of which website to visit as an example to crawl the information of Shanghai scenic spots. Code attached.

import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
'''
 1,data acquisition
'''
def get_urls(n):
    return ['https://travel.qunar.com/p-cs299878-shanghai-jingdian-1-' + str(i+1) for i in range(n)]
# Create function to get paging URL

def get_informations(u):
    ri = requests.get(u)
    # requests visit the website
    soupi = BeautifulSoup(ri.text,'lxml')
    # bs uses lxml to parse the page
    infori = soupi.find('ul',class_="list_item clrfix").find_all('li')
    # Get the list content according to the css tag of the web page
    datai = []
    n=0
    for i in infori:
        n+=1
        #print(i.text)
        dic = {}
        dic['lat'] = i['data-lat']
        dic['lng'] = i['data-lng']
        dic['Name of scenic spot'] = i.find('span',class_="cn_tit").text
        dic['Quantity mentioned in strategy'] = i.find('div',class_="strategy_sum").text
        dic['Quantity of comments'] = i.find('div',class_="comment_sum").text
        dic['Scenic spots ranking'] = i.find('span',class_="ranking_sum").text
        dic['Star class'] = i.find('span',class_="total_star").find('span')['style'].split(':')[1]
        datai.append(dic)
    # Get field contents respectively
    #print('collected% s data '% (n*10))
    return datai

# Build page crawler

url_lst = get_urls(5)
# Get 5 Web addresses

df = pd.DataFrame()
for u in url_lst:
    dfi = pd.DataFrame(get_informations(u))
    print dfi
    df = pd.concat([df, dfi])
    df.reset_index(inplace=True, drop=True)
# Data collection

'''
2,Field filtering and data cleaning
'''
df['lng'] = df['lng'].astype(np.float)
df['lat'] = df['lat'].astype(np.float)    
df['Quantity of comments'] = df['Quantity of comments'].astype(np.int)     
df['Quantity mentioned in strategy'] = df['Quantity mentioned in strategy'].astype(np.int)   
    # Field type processing

df['Star class'] = df['Star class'].str.replace('%','').astype(np.float)
   # Star field processing

df['Scenic spots ranking'] = df['Scenic spots ranking'].str.split('The first').str[1]
df['Scenic spots ranking'].fillna(value = 0,inplace = True) 

'''
3,export
'''
df.to_excel('JD.xlsx')

Posted by e7gaskell on Mon, 25 Nov 2019 10:59:02 -0800