Getting started with web crawler: your first crawler project (requests Library)

Keywords: Python pip network

0. Use requests Library

Although the urllib library is also widely used, and it is not necessary to install it as a python library, most of the python crawlers now use the requests library to handle complex http requests. Requests are simple in syntax, easy to understand in use, and are gradually becoming the standard of most network crawling.

1. Installation of requests Library
Using pip installation mode, input in the cmd interface:

pip install requests

Xiao Bian recommends a learning code of learning python: Nan Zhu (qun 491308659)
Whether you are Daniel or Xiaobai, want to change careers or want to join the industry, you can come to understand and learn together! There are development tools in the skirt, many dry goods and technical data sharing

2. Example code
We will deal with the header processing of http request to simply carry out anti crawler processing, as well as agent parameter setting, exception handling, etc.

import requests


def download(url, num_retries=2, user_agent='wswp', proxies=None):
    '''Download a specified URL And return to the page content
        //Parameters:
            url(str): URL
        //Key parameters:
            user_agent(str):User agent (default: wswp)
            proxies(dict):  Agent (Dictionary): Key:'http''https'
            //Value: string ('http(s)://IP ')
            num_retries(int):If there are 5 xx Retry on error (default: 2)
            #5xx server error indicates that the server is unable to complete the apparently valid request.
            #https://zh.wikipedia.org/wiki/HTTP%E7%8A%B6%E6%80%81%E7%A0%81
    '''
    print('==========================================')
    print('Downloading:', url)
    headers = {'User-Agent': user_agent} #Header settings. Sometimes the default header will be picked back by the web page and make an error
    try:
        resp = requests.get(url, headers=headers, proxies=proxies) #Simple and crude,. get(url)
        html = resp.text #Get web content in string form
        if resp.status_code >= 400: #Exception handling, 4xx client error returns None
            print('Download error:', resp.text)
            html = None
            if num_retries and 500 <= resp.status_code < 600:
                # 5 kinds of mistakes
                return download(url, num_retries - 1)#Retry twice if there is a server error

    except requests.exceptions.RequestException as e: #Other errors, normal error reporting
        print('Download error:', e)
        html = None
    return html #Return to html


print(download('http://www.baidu.com'))

Result:

Downloading: http://www.baidu.com
<!DOCTYPE html>
<!--STATUS OK-->

</script>

<script>
if(navigator.cookieEnabled){
    document.cookie="NOJS=;expires=Sat, 01 Jan 2000 00:00:00 GMT";
}
</script>



</body>
</html>

Posted by kriek on Thu, 05 Dec 2019 07:49:07 -0800