import urllib.request
I. Introduction
Urlib is Python's built-in HTTP request library, which includes the following modules:
- urllib.request: request module
- urllib.error: exception handling module
- urllib.parse: url parsing module
- Urllib.robot parser: robot.txt parsing module
2, Crawler specified URL
with urllib.request.urlopen("http://www.baidu.com") as file:
data = file.read() # Read all
line = file.readline() # Read a row
lines = file.readlines() #Make a list of all files by line and return
3, Download page to local
1. Save the read data to a file
with open("./1.html","wb") as f:
f.write(data)
2. Use urldrive to download directly to local
filename = urllib.request.urlretrieve("http://www.baidu.com","./2.html")
file.info()
<http.client.HTTPMessage at 0x1170c95be0>
4, Get request information
1. Get status code
file.getcode()
200
2. get url
file.geturl()
'http://www.baidu.com'
3. Get the head information
file.getheaders()
[('Date', 'Mon, 09 Apr 2018 17:11:24 GMT'), ('Content-Type', 'text/html; charset=utf-8'), ('Transfer-Encoding', 'chunked'), ('Connection', 'Close'), ('Vary', 'Accept-Encoding'), ('Set-Cookie', 'BAIDUID=4B4DEF37A228ED2722DF818D3F4A6C29:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'BIDUPSID=4B4DEF37A228ED2722DF818D3F4A6C29; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'PSTM=1523293884; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'BDSVRTM=0; path=/'), ('Set-Cookie', 'BD_HOME=0; path=/'), ('Set-Cookie', 'H_PS_PSSID=1430_21090_22160; path=/; domain=.baidu.com'), ('P3P', 'CP=" OTI DSP COR IVA OUR IND COM "'), ('Cache-Control', 'private'), ('Cxy_all', 'baidu+230416a5fbb4a587682dea3e4efe4e59'), ('Expires', 'Mon, 09 Apr 2018 17:11:05 GMT'), ('X-Powered-By', 'HPHP'), ('Server', 'BWS/1.1'), ('X-UA-Compatible', 'IE=Edge,chrome=1'), ('BDPAGETYPE', '1'), ('BDQID', '0xab6114e500016321'), ('BDUSERID', '0')]
5, Special character processing in URL
Use quote for encoding and unquote for decoding
s = urllib.request.quote("http://www.baidu.com")
s
'http%3A//www.baidu.com'
urllib.request.unquote(s)
'http://www.baidu.com'