Crawling web page data using Python

Keywords: Python crawler

When you need information on some web pages in the past, it is very convenient to use Python to write crawlers to crawl.

1. Use urllib.request to get the web page

Urllib is a built-in HTTP Library in Python. Using urllib, you can efficiently collect data in very simple steps; With Beautiful and other HTML parsing libraries, large crawlers for collecting network data can be written;

Note: the sample code is written in Python 3; Urllib is a combination of urllib and urllib 2 in Python 2. Urllib 2 in Python 2 corresponds to urllib.request in Python 3
Simple example:

import urllib.request                    # Introduce urllib.request

response =  urllib.request.urlopen('http://www.zhihu.com ') # open URL
html = response.read()            # Read content
html = html.decode('utf-8')            # decode
print(html)

2. Forge request header information

Sometimes the request initiated by the crawler will be rejected by the server. At this time, it is necessary to disguise the crawler as a browser of human users, which is usually realized by forging the request header information, such as:

import urllib.request

#Define save function

def saveFile(data):

path = "E:\\projects\\Spider\\02_douban.out"

f = open(path,'wb')

f.write(data)

f.close()

#website

url = "https://www.douban.com/"

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '

'Chrome/51.0.2704.63 Safari/537.36'}

req = urllib.request.Request(url=url,headers=headers)

res = urllib.request.urlopen(req)

data = res.read()

#You can also save the crawled content to a file

saveFile(data)

data = data.decode('utf-8')

#Print captured content

print(data)

#Print all kinds of information about crawling web pages

print(type(res))

print(res.geturl())

print(res.info())

print(res.getcode())

result:
The forged request header can use Google's plug-in Chrome UA Spoofer

Right click the option, and many human users' browsers can be forged and copied directly

3. Subject of forgery request

When crawling some websites, you need to POST data to the server. At this time, you need to forge the request subject;

In order to realize the online translation script of Youdao dictionary, open the development tool in Chrome and find the request with the method of POST under the Network. By observing the data, it can be found that the 'i' in the request body is the content that needs to be translated after URL coding. Therefore, the request body can be forged, such as:

import urllib.request
import urllib.parse
import json

while True:
    content = input('Please enter the content to be translated:')
    if content == 'exit!':
        break

    url='http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=null'

    # Request subject
    data = {}
    data['type'] = "AUTO"
    data['i'] = content
    data['doctype'] = "json"
    data['xmlVersion'] = "1.8"
    data['keyfrom'] = "fanyi.web"
    data['ue'] = "UTF-8"
    data['action'] = "FY_BY_CLICKBUTTON"
    data['typoResult'] = "true"
    data = urllib.parse.urlencode(data).encode('utf-8')

    head = {}
    head['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Firefox/45.0'

    req = urllib.request.Request(url,data,head)            # Forge request header and request body
    response = urllib.request.urlopen(req)
    html = response.read().decode('utf-8')

    target = json.loads(html)
    print('Translation results: ',(target['translateResult'][0][0]['tgt']))

You can also use add_ The header () method forges the request header, such as:

import urllib.request
import urllib.parse
import json

while True:
  content = input('Please enter the content to be translated(exit!):')
  if content == 'exit!':
        break

    url = 'http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=null'

    # Request subject
    data = {}
  data['type'] = "AUTO"
  data['i'] = content
  data['doctype'] = "json"
  data['xmlVersion'] = "1.8"
  data['keyfrom'] = "fanyi.web"
  data['ue'] = "UTF-8"
  data['action'] = "FY_BY_CLICKBUTTON"
  data['typoResult'] = "true"
  data = urllib.parse.urlencode(data).encode('utf-8')

  req = urllib.request.Request(url,data)
  req.add_header('User-Agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Firefox/45.0')

  response = urllib.request.urlopen(req)
  html=response.read().decode('utf-8')

  target = json.loads(html)
  print('Translation results: ',(target['translateResult'][0][0]['tgt']))

4. Use proxy IP

In order to avoid IP blocking caused by too frequent crawler collection, proxy IP can be used, such as:

# The parameter is a dictionary {'type': 'proxy ip: port number'}
proxy_support = urllib.request.ProxyHandler({'type': 'ip:port'})

# Customize an opener
opener = urllib.request.build_opener(proxy_support)

# Install opener
urllib.request.install_opener(opener)

#Call opener
opener.open(url)

Note: using crawlers to visit the target site too frequently will occupy a lot of server resources. Large scale distributed crawlers to crawl a site in a centralized manner is even equivalent to launching DDOS attacks on the site; Therefore, the crawling frequency and time should be reasonably arranged when using crawlers to crawl data; For example, crawling is performed when the server is relatively idle (e.g. early morning), and a period of time is suspended after completing a crawling task;

5. Detect the encoding mode of the web page

Although most web pages are encoded with UTF-8, sometimes we will encounter web pages using other encoding methods, so we must know the encoding method of web pages in order to correctly decode the crawled pages;

chardet is a third-party module of python, which can automatically detect the coding mode of web pages;

Install chardet: PIP install Charest
use:

import chardet
url = 'http://www,baidu.com'
html = urllib.request.urlopen(url).read()

>>> chardet.detect(html)
{'confidence': 0.99, 'encoding': 'utf-8'}

6. Get jump link

Sometimes a page needs to jump once or even several times on the basis of the original URL to finally reach the destination page, so it needs to deal with the jump correctly;

Obtain the URL of the jump link through the head() function of the requests module, such as

url='https://unsplash.com/photos/B1amIgaNkwA/download/'
res = requests.head(url)
re=res.headers['Location']

Posted by phorman on Mon, 20 Sep 2021 22:46:05 -0700