Python crawler practice: crawling cache proxy to build proxy IP pool

🌹 preface

Bloggers have started to update the actual combat tutorial of crawler. We look forward to your attention!!!
Part I: Python crawler actual combat (I): turn the page and crawl the data into SqlServer
Part II: Python crawler practice (2): crawling cache proxy to build proxy IP pool

Like collection bloggers have more creative power, and they will often be more creative in the future!!!

Purpose of building IP pool

When using crawlers, most websites have certain anti crawling measures. Some websites will limit the access speed or access times of each IP. If you exceed its limit, your IP will be blocked. The processing of access speed is relatively simple. It only needs to crawl once at an interval to avoid frequent access; For the number of visits, you need to use proxy IP to help. Using multiple proxy IP to access the target website in turn can effectively solve the problem.

At present, there are many agency service websites on the Internet to provide agency services and some free agents, but the availability is poor. If the demand is high, you can buy paid agents with good availability. Of course, we can also build our own proxy pool, obtain proxy IP for free from various proxy service websites, test its availability (visit Baidu), save it to a file, and call it when necessary.

Climb target

The page we want to climb is: https://www.kuaidaili.com/free/inha/

The red box is what we want to climb:

The final effect of blogger crawling is as follows:

preparation

I use Python 3.8, VScode editor, and the required libraries are: requests, etree, and time

Libraries required for import at the beginning:

import requests # python basic crawler Library
from lxml import etree # You can convert a web page to an Elements object
import time # Prevent climbing too fast and sleep for one second

Ready to start code analysis!

code analysis

First, let's talk about my overall idea of gradual analysis:

Step 1: construct the homepage url address, send the request and get the response
Step 2: analyze the data and group the data
Step 3: extract the data of the array
Step 4: check the availability of proxy IP
Step 5: save to file

First step

Construct the url address of the home page and send the request to get the response

# 1. Send request and get response
def send_request(self,page):
    print("=============Grabbing page{}page===========".format(page))
    # Target page, add headers parameter
    base_url = 'https://www.kuaidaili.com/free/inha/{}/'.format(page)
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}

    # Send request: simulate the browser to send a request and obtain the response data
    response = requests.get(base_url,headers=headers)
    data = response.content.decode()
    time.sleep(1)

    return data

There will be little friends who don't understand. What do you mean by headers?

Prevent the server from recognizing us as crawlers, so simulate the browser header information and send messages to the server
This "installation" must be installed!!!

Step 2

Analyze the data and group the data

As can be seen from the figure below, all the data we need are in the tr tag:

Therefore, the grouping is taken under the tr tag:

 # 2. Analyze data
def parse_data(self,data):
    
    # data conversion
    html_data =  etree.HTML(data)
    # Grouping data
    parse_list = html_data.xpath('//table[@class="table table-bordered table-striped"]/tbody/tr')

    return parse_list

Step 3

Extract the data, IP, type and port number we need in the packet

parse_list = self.parse_data(data)
for tr in parse_list:
	proxies_dict  = {}
	http_type = tr.xpath('./td[4]/text()')
	ip_num = tr.xpath('./td[1]/text()')
	port_num = tr.xpath('./td[2]/text()')
	
	http_type = ' '.join(http_type)
	ip_num = ' '.join(ip_num)
	port_num = ' '.join(port_num)
	
	proxies_dict[http_type] = ip_num + ":" + port_num
	
	proxies_list.append(proxies_dict)

Spliced here, {'HTTP':'36.111.187.154:8888 '} This form is stored in the list for our convenience!

Step 4

Check the availability of IP. Because it is a free IP, some may not work, and some access speed is slow. Here, we let the spliced IP access a certain degree for 0.1 seconds, and successfully save it in another list!

def check_ip(self,proxies_list):
	headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}
	
	can_use = []
	for proxies in proxies_list:
	    try:
	        response = requests.get('https://www.baidu.com/',headers=headers,proxies=proxies,timeout=0.1)
	        if response.status_code == 200:
	            can_use.append(proxies)
	
	    except Exception as e:
	        print(e)
	
	return can_use

Step 5

Save the ip with good access speed in a file for us to call

def save(self,can_use):
	file = open('IP.txt', 'w')
	for i in range(len(can_use)):
	    s = str(can_use[i])+ '\n'
	    file.write(s)
	file.close()

Complete code

import requests
from lxml import etree 
import time

class daili:

    # 1. Send request and get response
    def send_request(self,page):
        print("=============Grabbing page{}page===========".format(page))
        # Target page, add headers parameter
        base_url = 'https://www.kuaidaili.com/free/inha/{}/'.format(page)
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}

        # Send request: simulate the browser to send a request and obtain the response data
        response = requests.get(base_url,headers=headers)
        data = response.content.decode()
        time.sleep(1)

        return data

    # 2. Analyze data
    def parse_data(self,data):
        
        # data conversion
        html_data =  etree.HTML(data)
        # Grouping data
        parse_list = html_data.xpath('//table[@class="table table-bordered table-striped"]/tbody/tr')
    
        return parse_list

    # 4. Detect proxy IP
    def check_ip(self,proxies_list):
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}

        can_use = []
        for proxies in proxies_list:
            try:
                response = requests.get('https://www.baidu.com/',headers=headers,proxies=proxies,timeout=0.1)
                if response.status_code == 200:
                    can_use.append(proxies)

            except Exception as e:
                print(e)
                
        return can_use

    # 5. Save to file
    def save(self,can_use):

        file = open('IP.txt', 'w')
        for i in range(len(can_use)):
            s = str(can_use[i])+ '\n'
            file.write(s)
        file.close()
    
    # Implement main logic
    def run(self):
        proxies_list = []
        # To turn the page, I only crawled four pages here (the number of 5 can be modified)
        for page in range(1,5):
            data = self.send_request(page)
            parse_list = self.parse_data(data)
            # 3. Obtain data
            for tr in parse_list:
                proxies_dict  = {}
                http_type = tr.xpath('./td[4]/text()')
                ip_num = tr.xpath('./td[1]/text()')
                port_num = tr.xpath('./td[2]/text()')

                http_type = ' '.join(http_type)
                ip_num = ' '.join(ip_num)
                port_num = ' '.join(port_num)

                proxies_dict[http_type] = ip_num + ":" + port_num

                proxies_list.append(proxies_dict)
        
        print("Obtained proxy IP number:",len(proxies_list))

        can_use = self.check_ip(proxies_list)

        print("Available agents IP number:",len(can_use)) 
        print("Available agents IP:",can_use) 

        self.save(can_use)

if __name__ == "__main__": 
    dl = daili()
    dl.run()

The effects after startup are as follows:

And generate files:

O o!!!

usage method

IP is saved in the file, but some small partners don't know how to use it?

Here we need to realize that we randomly take an IP from the file to access the web address, and use the random library

import random
import requests

# Open file, wrap read
f=open("IP.txt","r")
file = f.readlines()

# Traverse and store them in the list respectively to facilitate random selection of IP addresses
item = []
for proxies in file:
    proxies = eval(proxies.replace('\n','')) # Split with newline character and convert to dict object
    item.append(proxies)

proxies = random.choice(item)  # Select an IP at random

url = 'https://www.baidu.com/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}

response = requests.get(url,headers=headers,proxies=proxies)
print(response.status_code) # The output status code 200 indicates that the access is successful

There is something wrong, I hope you guys can correct it!!!, If you don't understand, leave a message in the comment area and reply! Brothers, give a praise collection and update the actual battle of crawler when you are free!!!

Posted by zaiber on Sat, 27 Nov 2021 16:25:01 -0800

Programmer Group