Python Crawler Serial 5-Proxy, Cookie Resolution

Keywords: Python Session github Database Big Data

1. ProxyHandler Processing (Proxy Server)

1. Using proxy IP is a common means of crawling

2. Get the address of the proxy server:

www.xicidaili.com

www.goubanjia.com

3. Proxies are used to hide real access. Proxies do not allow frequent access to a fixed site, so proxies must be many.

4. Basic usage steps:

(1) Set proxy address

(2) Create a PoxyHandler

(3) Create Opener

(4) Install Opener

 

"""

//Use proxy to access Baidu Homepage

​

"""

from urllib import request,error

​

if __name__ =="__main__":

    url = "https://www.baidu.com"

    #Set proxy address

    proxy = {"http":"39.106.114.143:80"}

    #Establish ProxyHandler

    proxy_handler = request.ProxyHandler(proxy)

    #Establish Opener

    opener = request.build_opener(proxy_handler)

    #install Opener

    request.install_opener(opener)

​

    #Now if you visit url. Then the proxy server will be used

    try:

        rsp = request.urlopen(url)

        html = rsp.read().decode()

        print(html)

    except error.URLError as e:

        print(e)

    except Exception as e:

        print(e)

2. Cookies

1. Because the http protocol has no memory, people use a supplementary protocol to make up for this deficiency.

2. A cookie is a piece of information that is sent to a user (that is, an http browser), and a session is the corresponding half of the information that is stored on the server to record user information.

3. Differences between cookies and session s

(1) Stored in different locations; (2) Cookies are not safe; (3) session s will be kept on the server for a certain time and will expire; (3) Single cookie s will keep data for no more than 4k, and many browsers limit a site to a maximum of 20.

4.session storage location

(1) on the server; (2) session s are typically stored in memory or in a database.

5. Cases:

Feedback page is not logged on if no cookie is logged on

Log on using cookie s

 

from urllib import request

​

if __name__ == "__main__":

    url = "https://leetcode-cn.com/"

    headers = {

        "cookie":"_ga=GA1.2.606835635.1580743041; gr_user_id=d15dfef5-20a7-44a4-8181-f088825ee052; grwng_uid=1d99b83c-8186-4ffa-905e-c912960d9049; __auc=952db4f31700ba0a3811855dc67; csrftoken=zW1tIWrqqDGQ2gDeEAiRM3Pu41f3qetXjvNP5jxuDpekTTyHj262rmfnO2PtXiCI; LEETCODE_SESSION=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJfYXV0aF91c2VyX2lkIjoiOTUxOTE1IiwiX2F1dGhfdXNlcl9iYWNrZW5kIjoiYXV0aGVudGljYXRpb24uYXV0aF9iYWNrZW5kcy5QaG9uZUF1dGhlbnRpY2F0aW9uQmFja2VuZCIsIl9hdXRoX3VzZXJfaGFzaCI6ImQ0ODczNmFiODAwZjk0ZTU3ZjAwMmQ4YjU1YjRmNWZmMDViMDllOTIiLCJpZCI6OTUxOTE1LCJlbWFpbCI6IiIsInVzZXJuYW1lIjoicnVpZ2VnZTY2IiwidXNlcl9zbHVnIjoicnVpZ2VnZTY2IiwiYXZhdGFyIjoiaHR0cHM6Ly9hc3NldHMubGVldGNvZGUtY24uY29tL2FsaXl1bi1sYy11cGxvYWQvZGVmYXVsdF9hdmF0YXIucG5nIiwicGhvbmVfdmVyaWZpZWQiOnRydWUsInRpbWVzdGFtcCI6IjIwMjAtMDItMDMgMTU6MTg6MDYuNjYw160b58f59beeae32; a2873925c34ecbd2_gr_session_id=e9ba4267-3dbc-47c1-aa02-c6e92e8eb4a8; a2873925c34ecbd2_gr_last_sent_sid_with_cs1=e9ba4267-3dbc-47c1-aa02-c6e92e8eb4a8; a2873925c34ecbd2_gr_session_id_e9ba4267-3dbc-47c1-aa02-c6e92e8eb4a8=true; _gid=GA1.2.1242221115.1580917808; Hm_lpvt_fa218a3ff7179639febdb15e372f411c=1580917870; a2873925c34ecbd2_gr_cs1=ruigege66; _gat_gtag_UA_131851415_1=1"

    }

    req = request.Request(url,headers=headers)

    rsp = request.urlopen(req)

    html = rsp.read().decode()

    with open("rsp.html","w") as f:

        f.write(html.encode("GBK","ignore").decode("GBK"))

3. Source Code

Reptitle5_Proxy.py

Reptitle6_Cookie.py

https://github.com/ruigege66/PythonReptile/blob/master/Reptitle5_Proxy.py

https://github.com/ruigege66/PythonReptile/blob/master/Reptitle6_Cookie.py

2.CSDN: https://blog.csdn.net/weixin_44630050

3. Blog Park: https://www.cnblogs.com/ruigege0000/

4. Welcome to the WeChat Public Number: Fourier Transform, Personal Public Number, only for learning and communication, Background Reply "Gift Pack" to get big data learning materials

 

Posted by landavia on Wed, 05 Feb 2020 08:49:12 -0800