0. Use requests Library
Although the urllib library is also widely used, and it is not necessary to install it as a python library, most of the python crawlers now use the requests library to handle complex http requests. Requests are simple in syntax, easy to understand in use, and are gradually becoming the standard of most network crawling.
1. Installation of requests Library
Using pip installation mode, input in the cmd interface:
pip install requests
Xiao Bian recommends a learning code of learning python: Nan Zhu (qun 491308659)
Whether you are Daniel or Xiaobai, want to change careers or want to join the industry, you can come to understand and learn together! There are development tools in the skirt, many dry goods and technical data sharing
2. Example code
We will deal with the header processing of http request to simply carry out anti crawler processing, as well as agent parameter setting, exception handling, etc.
import requests def download(url, num_retries=2, user_agent='wswp', proxies=None): '''Download a specified URL And return to the page content //Parameters: url(str): URL //Key parameters: user_agent(str):User agent (default: wswp) proxies(dict): Agent (Dictionary): Key:'http''https' //Value: string ('http(s)://IP ') num_retries(int):If there are 5 xx Retry on error (default: 2) #5xx server error indicates that the server is unable to complete the apparently valid request. #https://zh.wikipedia.org/wiki/HTTP%E7%8A%B6%E6%80%81%E7%A0%81 ''' print('==========================================') print('Downloading:', url) headers = {'User-Agent': user_agent} #Header settings. Sometimes the default header will be picked back by the web page and make an error try: resp = requests.get(url, headers=headers, proxies=proxies) #Simple and crude,. get(url) html = resp.text #Get web content in string form if resp.status_code >= 400: #Exception handling, 4xx client error returns None print('Download error:', resp.text) html = None if num_retries and 500 <= resp.status_code < 600: # 5 kinds of mistakes return download(url, num_retries - 1)#Retry twice if there is a server error except requests.exceptions.RequestException as e: #Other errors, normal error reporting print('Download error:', e) html = None return html #Return to html print(download('http://www.baidu.com'))
Result:
Downloading: http://www.baidu.com <!DOCTYPE html> <!--STATUS OK--> </script> <script> if(navigator.cookieEnabled){ document.cookie="NOJS=;expires=Sat, 01 Jan 2000 00:00:00 GMT"; } </script> </body> </html>