Usage of urlib Library of Python crawler (1): crawling, saving pages, obtaining request information

Keywords: encoding Python IE

import urllib.request

I. Introduction

Urlib is Python's built-in HTTP request library, which includes the following modules:

urllib.request: request module
urllib.error: exception handling module
urllib.parse: url parsing module
Urllib.robot parser: robot.txt parsing module

2, Crawler specified URL

with urllib.request.urlopen("http://www.baidu.com") as file:
    data = file.read() # Read all
    line = file.readline() # Read a row
    lines = file.readlines() #Make a list of all files by line and return

3, Download page to local

1. Save the read data to a file

with open("./1.html","wb") as f:
    f.write(data)

2. Use urldrive to download directly to local

filename = urllib.request.urlretrieve("http://www.baidu.com","./2.html")

file.info()

<http.client.HTTPMessage at 0x1170c95be0>

4, Get request information

1. Get status code

file.getcode()

2. get url

file.geturl()

'http://www.baidu.com'

3. Get the head information

file.getheaders()

[('Date', 'Mon, 09 Apr 2018 17:11:24 GMT'),
 ('Content-Type', 'text/html; charset=utf-8'),
 ('Transfer-Encoding', 'chunked'),
 ('Connection', 'Close'),
 ('Vary', 'Accept-Encoding'),
 ('Set-Cookie',
  'BAIDUID=4B4DEF37A228ED2722DF818D3F4A6C29:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'),
 ('Set-Cookie',
  'BIDUPSID=4B4DEF37A228ED2722DF818D3F4A6C29; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'),
 ('Set-Cookie',
  'PSTM=1523293884; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'),
 ('Set-Cookie', 'BDSVRTM=0; path=/'),
 ('Set-Cookie', 'BD_HOME=0; path=/'),
 ('Set-Cookie', 'H_PS_PSSID=1430_21090_22160; path=/; domain=.baidu.com'),
 ('P3P', 'CP=" OTI DSP COR IVA OUR IND COM "'),
 ('Cache-Control', 'private'),
 ('Cxy_all', 'baidu+230416a5fbb4a587682dea3e4efe4e59'),
 ('Expires', 'Mon, 09 Apr 2018 17:11:05 GMT'),
 ('X-Powered-By', 'HPHP'),
 ('Server', 'BWS/1.1'),
 ('X-UA-Compatible', 'IE=Edge,chrome=1'),
 ('BDPAGETYPE', '1'),
 ('BDQID', '0xab6114e500016321'),
 ('BDUSERID', '0')]

5, Special character processing in URL

Use quote for encoding and unquote for decoding

s = urllib.request.quote("http://www.baidu.com")
s

'http%3A//www.baidu.com'

urllib.request.unquote(s)

'http://www.baidu.com'

Posted by bschwarz on Thu, 02 Apr 2020 22:36:14 -0700

Programmer Group

Usage of urlib Library of Python crawler (1): crawling, saving pages, obtaining request information

I. Introduction

Urlib is Python's built-in HTTP request library, which includes the following modules:

2, Crawler specified URL

3, Download page to local

1. Save the read data to a file

2. Use urldrive to download directly to local

4, Get request information

1. Get status code

2. get url

3. Get the head information

5, Special character processing in URL

Hot Keywords