Python 3 crawler learning notes (1) first crawler urllib learning

Keywords: encoding Python JSON Fragment

I. What are reptiles?

Web crawlers (also known as web pages) spider It is a kind of automatic grabbing according to certain rules. Web A program or script for information. If the Internet is compared to a large spider web, each node in the spider web has a lot of data. A crawler is like a small spider who finds the website through the address of the web page and gets information: HTML code / JSON data / binary data (pictures, videos).

2. URL s - The address crawler knows where to crawl

The address corresponding to the browser is the URL, for example https://www.baidu.com/?tn=02049043_62_pg

The format of the URL: (optional with square brackets [] protocol: // hostname [: port]/ path [; parameters][? Query] fragment

(1) Protocol: A designated transport protocol

(2) hostname: refers to the server that stores resources Domain name system (DNS) Host name or IP address. Sometimes, the username and password needed to connect to the server can be included in front of the host name (format: username:password@hostname).

(3) port (port number): integer, optional, default port of the scheme when omitted, various transport protocol There are default port numbers, such as http's default port of 80.

(4) path: A string separated by zero or more "/" symbols, usually used to represent a directory or file address on the host.

Three, urllib

In python, the module urllib is often used by learning crawlers. Here, we introduce urllib with some examples.

Urllib includes four modules: urllib. request, urllib. error, urllib. parse, urllib. robot parser

urllib.request: Used to open and read URL s

urllib.error: Includes exceptions generated by urllib.request

urllib.parse: Used to parse and process URL s

urllib.robotparse: robots.txt file for parsing pages

(1)urllib.request module - open and read websites

Simple Opening and Reading Websites

urllib.request.urlopen(url): Used to open a website (actually there are other parameters, you can check the help in python).

It returns as a text, which can be read with read ().

import urllib.request

response = urllib.request.urlopen("https://www.baidu.com")
html = response.read()
print(html)

The result of this time is binary scrambling.

​
b'<html>\r\n<head>\r\n\t<script>\r\n\t\tlocation.replace(location.href.replace("https://
","http://"));\r\n\t</script>\r\n</head>\r\n<body>\r\n\t<noscript><meta http-
equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>\r\n</body>\r\n</html>'
>>> 

At this time, decode () can be added to parse the code of the web page, but to find the encoding format in the code of the web page is utf-8 (review element, open at the head point, charset there).

import urllib.request

response = urllib.request.urlopen("https://www.baidu.com")
html = response.read()
html = html.decode("utf-8")
print(html)
<html>

<head>

	<script>

		location.replace(location.href.replace("https://","http://"));

	</script>

</head>

<body>

	<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>

</body>

</html>

The whole result will be much more neat.

Additional problem: Getting the encoding method

There is a problem here. It's a bit troublesome to get the encoding method. Here we look at other ways to get the encoding method:

Using chardet.detect () in the third-party library chardet, you can get a dictionary containing the encoding method.

import chardet
import urllib

url="https://www.baidu.com"

content = urllib.urlopen(url).read()
result= chardet.detect(content)#What you get here is a dictionary.
encoding = result['encoding']#Get the encoding in the dictionary
print(encoding)

Other ways to read a website besides read ()

geturl (): A string that returns the URL

info (): Information returned by meta tags (an important html tag in html web source code, such as author, date and time, web description, keywords, page refresh, etc.)

getcode (): The status code of http is returned, and if 200 is returned, the request is successful. The server 3-digit code for HTTP response status, defined by RFC 2616 specification)

import urllib.request

url="https://www.baidu.com"

html = urllib.request.urlopen(url)
print("geturl:%s\n"%(html.geturl()))
print("info:%s\n"%(html.info()))
print("getcode:%s\n"%(html.getcode))
geturl:https://www.baidu.com

info:Accept-Ranges: bytes
Cache-Control: no-cache
Content-Length: 227
Content-Type: text/html
Date: Sun, 22 Jul 2018 03:01:02 GMT
Etag: "5b3c3650-e3"
Last-Modified: Wed, 04 Jul 2018 02:52:00 GMT
P3p: CP=" OTI DSP COR IVA OUR IND COM "
Pragma: no-cache
Server: BWS/1.1
Set-Cookie: BD_NOT_HTTPS=1; path=/; Max-Age=300
Set-Cookie: BIDUPSID=98CE4FF720D964752DFFF7F8757770E8; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: PSTM=1532228462; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Strict-Transport-Security: max-age=0
X-Ua-Compatible: IE=Edge,chrome=1
Connection: close



getcode:<bound method HTTPResponse.getcode of <http.client.HTTPResponse object at 0x0415C290>>

Send data to server with data parameter of urlopen

Data is an optional parameter in urlopen that can send data to the server (according to http specifications, GET is used to get information, POST submits data to the server). If you pass this data parameter, it is no longer a GET request, but a POST request. But data is the content of byte stream encoding format, that is, bytes type, which can be automatically converted to the above format by the urllib.parse.urlencode() function that you learn next.

 

The next article continues with the use of other modules of urllib...

 

Posted by patrickm on Sun, 19 May 2019 03:36:40 -0700