Basic introduction to Scrapy crawler framework

Keywords: Python crawler Python crawler

Scrapy crawler framework

To the most professional web crawler framework learning part, we should play a better spirit to meet new challenges.

1. Installation

pip install scrapy

After installation, enter the following command to test the effect:

scrapy -h

The correct response indicates that the installation has been successful.

2. Brief description of scripy framework

Scratch is a crawler framework. It is a collection of software structure and functional components that can help users realize professional web crawlers.

How to understand? In fact, it can be considered that this framework has several small components that work together to form a data flow and form such a large collection of components. The following figure shows the components of the framework:

Here, to realize the crawler function, we should start with Spider and Item Pipeline. Because Engine and Downloader are existing function implementations, so is Scheduler.

At this time, the essence of crawler is to improve the configuration of crawler framework.

3. Requests and scrape

Similarities:

  • Both can request and crawl web pages
  • Good usability and simple documentation
  • There are no functions to handle js, submit forms and deal with verification codes

difference:

requestscrapy
Page level crawlerWeb crawler
Function libraryframe
Insufficient consideration of concurrency and low efficiencyGood concurrency
The focus is on page downloadThe focus is on reptile structure
Flexible customizationGeneral customization flexibility
Easy to useIt's a little difficult to get started

4. Scripy common commands

The common command formats of scripy are as follows:

scrapy <command> [option] [args]

Among them, command is mainly a common command, including the following:

commandexplainformat
startprojectCreate a new crawlerscrapy startproject [dir]
genspiderCreate a crawlerscrapy genspider [option]
settingsGet crawler configuration informationscrapy settings [options]
crawlRun a crawlerscrapy crawl
listList all crawlers in the projectscrapy list
shellStart url debug command linescrapy shell [url]

A project is a large framework in which many crawlers can be placed in the project, which is equivalent to a downloader within the project.

Moreover, this automation is suitable for script operation and focuses on function.

5. Examples of Scrapy

1. Build a crawler project:

The file directory is shown in the figure above.

The project name is python123demo, and the next layer is the deployment configuration file config and user-defined Python code, which is usually consistent with the project name.

On the next level, we also have these python files, which correspond to several of the five functional components.

2. Create a crawler

The generated content is as follows:

import scrapy
class DemoSpider(scrapy.Spider):
    name = 'demo'
    allowed_domains = ['python123.io']
    start_urls = ['http://python123.io/']

    def parse(self, response):
        pass

Of course, we can also manually generate the part that completes the creation of the crawler.

3. Configure crawler

For the crawler created above, we try to save an html page into an html file.

class DemoSpider(scrapy.Spider):
    name = 'demo'
    #allowed_domains = ['python123.io']
    start_urls = ['http://python123.io/ws/demo.html']

	#Returns the crawler and parses it
    def parse(self, response):
        fname = response.url.split('/')[-1]
        with open(fname , 'wb') as f:
            f.write(response.body)
        self.log('Save file %s.' % name)

Second, run:

scrapy crawl demo

The results are as follows:

<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>

6,yield

Is a generator, a function that continuously generates values, and the function containing the yield statement is a generator.

Each time the generator generates a value, the function is frozen and awakened to produce a new value.

def gen(n):
	for i in range(n):
		yield i**2

The above content function is to generate the square value of a number less than n. Because the generator returns a return value and is inside a loop.

Advantages of generator:

1. Better storage space

2. More flexible response

3. Flexible use

summary

The above is the introduction of the basic concepts of scripy. The practice part still needs to be completed by your own experience and accumulation. Thank you for reading.

Posted by cjl on Thu, 02 Dec 2021 18:26:02 -0800