Python crawler learning notes

4.2 urllib.parse

4.2.1 url encoding and decoding

A url can only be composed of specific characters (letters, numbers, underscores). If other characters appear, such as ￥, space, Chinese, etc., they must be encoded, otherwise the request cannot be sent.

urllib.parse.unquote is the url decoding function, which decodes the utf8 encoding% XXX of non letters, numbers and underscores in the url into the corresponding characters.
urllib.parse.quote is a url encoding function, which converts non letters, numbers and underscores into their corresponding utf8 encoding in the form of% XXX.

Sample code
The links we use often contain Chinese, such as:“ https://segmentfault.com/sear... ”Links like this need to be coded to send requests.

import urllib.parse

url = 'https://segmentfault.com/search?q=markdown%E8%AF%AD%E6%B3%95'

unencd=urllib.parse.unquote(url)
print(unencd)
encd=urllib.parse.quote(unencd)
print(encd)

Output is:

https://segmentfault.com/search?q=markdown syntax
https%3A//segmentfault.com/search%3Fq%3Dmarkdown%E8%AF%AD%E6%B3%95

4.2.2 parameter splicing

url often needs to pass parameters. When crawling a web page, these parameters need to be spliced into a specific format, and some non letters, numbers and underscores need to be converted into corresponding codes.

Urllib.parse.urlencode splices the parameters into query string, and converts non letters, numbers and underscores into corresponding codes.
Sample code

data={'name':'Small A','age':'15','sex':'male'}
query_string=urllib .parse.urlencode(data)
print(query_string)

Output is:

name=%E5%B0%8FA&age=15&sex=%E7%94%B7

5.get mode

Sample code

import urllib.request
import urllib.parse


word = input('Please enter what you want to enter:')
url = 'http://www.baidu.com/s? '(written as https, you can't get it. I don't know why


data ={'ie':'utf-8',
       'wd':word}
query_string = urllib.parse.urlencode(data)
url += query_string

response = urllib.request.urlopen(url)

filename=word+'.html'
with open(filename,'wb') as fp:
    fp.write(response.read())

The result of crawling is an html file.

Please input what you want to input: China

6. camouflage UA

If you don't camouflage the crawler, it may be anti crawled by the website. For example, the following code, without camouflage itself, can clearly see that the request is made by a crawler.
Sample code

import urllib.request
import urllib.parse

url = 'http://www.baidu.com / 'ා the last / cannot be saved, otherwise it is not the complete url. HTTP Error 400: Bad Request may occur
response = urllib.request.urlopen(url)
print(response.read().decode())

Open Fiddler and grab the request before running the program. It can be seen from the result of packet capturing that the UA (user agent) in the request header is: Python urlib / 3.6. It's easy to be detected by the other side, and to be anti crawled.

UA can be easily obtained from the packets captured by Fiddler, so it can use the normal requested UA to replace the crawler's UA for camouflage.
Example code of crawler after camouflage

import urllib.request
import urllib.parse
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) '
                        'AppleWebKit/537.36 (KHTML, like Gecko)'
                        ' Chrome/76.0.3809.132 Safari/537.36'}
url = 'http://www.baidu.com / 'ා the last / cannot be saved, otherwise it is not the complete url. HTTP Error 400: Bad Request may occur
request = urllib.request.Request(url, headers = headers)

response = urllib.request.urlopen(request)
print(response.read().decode())

It can be seen from Fiddler that the request header information has been disguised as a normal browser.

                                   ----This concludes on November 14, 2017, 2019----

Posted by abheesree on Thu, 14 Nov 2019 02:08:54 -0800

Programmer Group