Python 3 Web Crawler Actual Warfare-21, Using Urllib: Handling Exceptions

Keywords: Python socket Attribute encoding

In the previous section, we learned about the sending process of Request, but in the case of bad network conditions, what can we do if there is an exception? At this time, if we do not handle these exceptions, the program is likely to report errors and terminate the operation, so exception handling is very necessary.

Urllib's error module defines the exception generated by the request module. If a problem arises, the request module throws an exception defined in the error module, which is described in detail in this section.

1. URLError

The URLError class comes from the error module of the Urllib library. It inherits from the OSError class and is the base class of the error exception module. Exceptions generated by the request module can be handled by capturing this class.

It has an attribute reason, the reason for returning the error.

Here's an example to experience:

from urllib import request, error
try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.URLError as e:
    print(e.reason)

When we open a page that does not exist, we should normally report an error, but then we catch the URLError exception. The result is as follows:

Not Found
Python Resource sharing qun 784758214 ,Installation packages are included. PDF,Learning videos, here is Python The gathering place of learners, zero foundation and advanced level are all welcomed.

The program does not report errors directly, but outputs the contents as above. By doing so, we can avoid abnormal termination of the program, and the abnormal has been effectively handled.

2. HTTPError

It is a subclass of URLError and is designed to handle HTTP request errors, such as authentication request failure, and so on.

It has three attributes.

  • Code, return HTTP Status Code, that is, status code, such as 404 page does not exist, 500 server internal errors and so on.
  • reason, like the parent class, returns the cause of the error.
  • headers, return Request Headers.

Let's take a few examples to feel the following:

from urllib import request,error
try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')

Operation results:

Not Found
404
Date: Mon, 17 Jun 2019 04:52:50 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding
Status: 404 Not Found
Cache-Control: no-cache
Strict-Transport-Security: max-age=31536000
X-XSS-Protection: 1; mode=block
X-Request-Id: e65fb029-a4fd-46e2-91c3-9616ccc2f879
X-Runtime: 0.006814
X-Frame-Options: SAMEORIGIN
X-Content-Type-Options: nosniff
X-Powered-By: Phusion Passenger 6.0.2
Server: nginx + Phusion Passenger 6.0.2

It's still the same web site, where we capture HTTP Error exceptions and output reason, code, headers attributes.

Because URLError is the parent of HTTP Error, we can choose to capture the errors of the subclass first, and then to catch the errors of the parent class. So the above code is better written as follows:

from urllib import request, error

try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print('Request Successfully')

In this way, we can capture HTTP Error first and get its error status code, cause, Headers and other details. If it is not HTTP Error, the URLError exception is captured and the cause of the error is output. Finally, use else to handle normal logic, which is a better exception handling method.

Sometimes the reason attribute does not necessarily return a string or an object. Let's look at the following example:

import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('https://www.baidu.com', timeout=0.01)
except urllib.error.URLError as e:
    print(type(e.reason))
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')

Here we set the timeout time directly to force a timeout exception to be thrown.

The results are as follows:

<class 'socket.timeout'>
TIME OUT
Python Resource sharing qun 784758214 ,Installation packages are included. PDF,Learning videos, here is Python The gathering place of learners, zero foundation and advanced level are all welcomed.

You can find that the result of the reason attribute is the socket.timeout class. So here we can use isinstance() method to judge its type and make more detailed abnormal judgment.

3. Concluding remarks

This section describes the use of error module. By capturing exceptions reasonably, we can make more accurate exceptions judgment and make the program better and more robust.

Posted by marms on Sat, 03 Aug 2019 08:34:11 -0700