web crawler explanation - urllib library crawler - state - exception handling - browser camouflage technology, user agent settings

Keywords: Windows Python Mac OS X

If the crawler does not have exception handling, the program will crash and stop working once an error occurs in the crawl, and exception handling can continue even if an error occurs.

1. Common status codes

301: redirect to a new URL, permanent
302: Redirected to temporary URL, non-permanent
304: Requested resources not updated
400: Illegal requests
401: Request unauthorized
403: No access
404: No corresponding page was found
500: Error inside server
501: The server does not support the functions required to implement the request

2. exception handling

URLError captures exception information

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib.request
import urllib.error

try:                                    #Try to execute the contents
    html = urllib.request.urlopen('http://www.xiaohuar.com/').read().decode("utf-8")
    print(html)

except urllib.error.URLError as e:      #If an error occurs
    if hasattr(e,"code"):               #If there is an error code
        print(e.code)                   #Print error code
    if hasattr(e,"reason"):             #If there is an error message
        print(e.reason)                 #Print error message

#Reptilian access is prohibited on the return instructions website
# 403
# Forbidden

Browser Camouflage Technology

Many websites have done anti-crawling technology, generally in the background to detect whether there is User-Agent browser information in the request header information, if there is no explanation is not browser access, the request is blocked.

Python Resource Sharing Skirt: 855408893 has installation packages, learn video materials, update technology every day. Here is the gathering place of Python learners, zero foundation, advanced, welcome to click Python resource sharing

So we need to disguise browser headers to request

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib.request
url = 'https://www.qiushibaike.com/' Crawl page URL
tou = ('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0')  #Setting up Analog Browser Header
b_tou = urllib.request.build_opener()               #Create Request Object
b_tou.addheaders=[tou]                              #Add headers
html = b_tou.open(url).read().decode("utf-8")       #Start crawling pages
print(html)

Note: We can see that this request is not requested by the urlopen() method. At this time, we cannot request it by urlopen(), but we will feel that it is very difficult to create build_opener() for every request. So we need to set up the urlopen() method to request the automatic header.

Setting up an automatic header using the urlopen() method, that is, setting up a user agent

install_opener() sets header information to global and automatically adds headers when urlopen() method requests

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib.request
#Setting header information
tou = ('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0')  #Setting up Analog Browser Header
b_tou = urllib.request.build_opener()               #Create Request Object
b_tou.addheaders=[tou]                              #Add header to request object
#Setting header information to global will automatically add headers when urlopen() method requests
urllib.request.install_opener(b_tou)

#request
url = 'https://www.qiushibaike.com/'
html = urllib.request.urlopen(url).read().decode("utf-8")
print(html)

Creating User Agent Pool

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib.request
import random   #Introducing Random Module Files

def yh_dl():    #Creating User Agent Pool
   yhdl = [
       'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
       'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
       'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
       'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
       'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
       'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
       'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
       'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
       'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)',
       'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
       'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)',
       'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)',
       'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)',
       'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
       'Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5',
       'User-Agent:Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5',
       'Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5',
       'Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1',
       'Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10',
       'Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13',
       'Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+',
       'Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)',
       'UCWEB7.0.2.37/28/999',
       'NOKIA5700/ UCWEB7.0.2.37/28/999',
       'Openwave/ UCWEB7.0.2.37/28/999',
       'Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999'
       ]
   thisua = random.choice(yhdl)                    #Random access to proxy information
   headers = ("User-Agent",thisua)                 #Stitching header information
   opener = urllib.request.build_opener()          #Create Request Object
   opener.addheaders=[headers]                     #Add header to request object
   urllib.request.install_opener(opener)           #Setting header information to global will automatically add headers when urlopen() method requests

#request
yh_dl()     #Executing user agent pool functions
url = 'https://www.qiushibaike.com/'
html = urllib.request.urlopen(url).read().decode("utf-8")
print(html)

//What I don't know in the process of learning can be added to me?
python learning resource qun,855 408 893
//There are good learning video tutorials, development tools and e-books in the group.
//Share with you the current talent needs of python enterprises and how to learn python from zero foundation, and what to learn

In this way, the crawler will call randomly, and the user agent, that is, the random header, ensures that each header information is different.

Posted by RandomZero on Mon, 12 Aug 2019 06:09:59 -0700