4.2 urllib.parse
4.2.1 url encoding and decoding
A url can only be composed of specific characters (letters, numbers, underscores). If other characters appear, such as ¥, space, Chinese, etc., they must be encoded, otherwise the request cannot be sent.
urllib.parse.unquote is the url decoding function, which decodes the utf8 encoding% XXX of non letters, numbers and underscores in the url into the corresponding characters.
urllib.parse.quote is a url encoding function, which converts non letters, numbers and underscores into their corresponding utf8 encoding in the form of% XXX.
Sample code
The links we use often contain Chinese, such as:“ https://segmentfault.com/sear... ”Links like this need to be coded to send requests.
import urllib.parse url = 'https://segmentfault.com/search?q=markdown%E8%AF%AD%E6%B3%95' unencd=urllib.parse.unquote(url) print(unencd) encd=urllib.parse.quote(unencd) print(encd)
Output is:
https://segmentfault.com/search?q=markdown syntax https%3A//segmentfault.com/search%3Fq%3Dmarkdown%E8%AF%AD%E6%B3%95
4.2.2 parameter splicing
url often needs to pass parameters. When crawling a web page, these parameters need to be spliced into a specific format, and some non letters, numbers and underscores need to be converted into corresponding codes.
Urllib.parse.urlencode splices the parameters into query string, and converts non letters, numbers and underscores into corresponding codes.
Sample code
data={'name':'Small A','age':'15','sex':'male'} query_string=urllib .parse.urlencode(data) print(query_string)
Output is:
name=%E5%B0%8FA&age=15&sex=%E7%94%B7
5.get mode
Sample code
import urllib.request import urllib.parse word = input('Please enter what you want to enter:') url = 'http://www.baidu.com/s? '(written as https, you can't get it. I don't know why data ={'ie':'utf-8', 'wd':word} query_string = urllib.parse.urlencode(data) url += query_string response = urllib.request.urlopen(url) filename=word+'.html' with open(filename,'wb') as fp: fp.write(response.read())
The result of crawling is an html file.
Please input what you want to input: China
6. camouflage UA
If you don't camouflage the crawler, it may be anti crawled by the website. For example, the following code, without camouflage itself, can clearly see that the request is made by a crawler.
Sample code
import urllib.request import urllib.parse url = 'http://www.baidu.com / 'ා the last / cannot be saved, otherwise it is not the complete url. HTTP Error 400: Bad Request may occur response = urllib.request.urlopen(url) print(response.read().decode())
Open Fiddler and grab the request before running the program. It can be seen from the result of packet capturing that the UA (user agent) in the request header is: Python urlib / 3.6. It's easy to be detected by the other side, and to be anti crawled.
UA can be easily obtained from the packets captured by Fiddler, so it can use the normal requested UA to replace the crawler's UA for camouflage.
Example code of crawler after camouflage
import urllib.request import urllib.parse headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) ' 'AppleWebKit/537.36 (KHTML, like Gecko)' ' Chrome/76.0.3809.132 Safari/537.36'} url = 'http://www.baidu.com / 'ා the last / cannot be saved, otherwise it is not the complete url. HTTP Error 400: Bad Request may occur request = urllib.request.Request(url, headers = headers) response = urllib.request.urlopen(request) print(response.read().decode())
It can be seen from Fiddler that the request header information has been disguised as a normal browser.
----This concludes on November 14, 2017, 2019----