python crawler 01 - create a simple crawler (with 100G of a novel web database)

Keywords: Database Selenium xml Python

Disclaimer: the database provided in this article is for technical verification only, and any form of commercial and reprint activities based on this database are prohibited. The legal liability arising therefrom shall be borne by itself! , once you continue to read this article, you will be deemed to agree to this statement!

Straight dividing line

Principle of reptiles

Reptiles are mainly divided into three parts:
1. Download module: the main labor force to be crawled, as written in the name, to provide download services
2. Analysis module: the analysis module has two main functions: one is to extract the web address from the analysis page and feed it back to the download module for further download; the other is to provide the content of the analysis page to the database for writing
3. Write module: write the result of analysis or the website of the queue to the database

Among them, the download module and the write module are IO intensive programs. The high-performance method is to use asynchronous IO to write the cooperation code respectively, the analyzer coordinates in the middle, and the queue is used for inter process communication.

If multi-core CPU is used, please increase the process as appropriate.

If you write a distributed crawler, you need to add communication between servers to assign tasks.

1. Download module

The function of the download module is to receive the URL task and return the request result. For websites with better anti crawling measures, we need to write ip proxy rotation pool, special headers request header, and simulated ajax request

For websites that can't be crawled, you can use selenium + Phanton JS to simulate normal human operation to get information
python crawler from getting started to giving up (VIII) the use of Selenium Library (Reprint)

Web builder:

# Analyze the target website and find that its website structure is' http://www.xxx.com/book/a/a+b / '
# Write the corresponding URL generator
# For the principle of generator, please check Baidu iterator and generator, and leave a message if you need to explain in a new blog


def create_url(numb):
    a, b = 1, 0
    n = 1000
    while a <= numb:
        # print(a)

        # Add yield to make function a generator template
        # yield returns the value in '' when the function runs
        # Yield can be understood as pausing and returning the value after yield each time the output is iterated
        yield "http://www.xxx.com/book/%d/%d%03d/" % (a, a, b)
        if b < n:
            b += 1
        else:
            b = 0
            a += 1


# Create an iteratable object
obj = create_url(22)
# Each time the next() function is called, a web address can be generated. The example is as follows
# print(next(obj))
# for i in range(10):
#     print(next(obj))

Download function writing:

# Write download function
# Input URL, output HTML
import requests


def download_url(url):
    # Set the headers dictionary and add cookie s as needed
    headers = {'Host': 'www.ybdu.com',
               'Connection': 'keep-alive',
               'Cache-Control': 'max-age=0',
               'Upgrade-Insecure-Requests': '1',
               'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36',
               'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
               'Accept-Encoding': 'gzip, deflate, br',
               'Accept-Language': 'zh-CN,zh;q=0.9',
               # 'Cookie':...
               }
    response = requests.get(url=url, headers=headers)
    if response.status_code == 200:
        return response.text
    else:
        return "%s url:%s" % (str(response.status_code), url)

2. Analysis module

Using regular and xpath to extract web content

To be updated

3. Write module

Write data to database using pymysql module or other database driver module

To be updated

4.main function

Combine the three modules into a complete suggested crawler

To be updated

5. Database sharing

-- Database name:article,Jurisdiction:select
-- Domestic small water pipe,Use caution
mysql -h 27.50.142.39 -P 3456 -u wangler2333 -p wangler2333 

-- Spare sql File download address:
http://35.187.144.52/article.tar.gz

Posted by noobie_daddy on Wed, 01 Apr 2020 18:59:41 -0700