Scarpy source code analysis 5

Keywords: Python

2021SC@SDUSC

3.3 ExecutionEngine

look down open_spider

# scrapy/core/engine.py

class ExecutionEngine(object):
    # ...

    @defer.inlineCallbacks
    def open_spider(self, spider, start_requests=(), close_if_idle=True):
        assert self.has_capacity(), "No free spider slot when opening %r" % \
            spider.name
        logger.info("Spider opened", extra={'spider': spider})
        nextcall = CallLaterOnce(self._next_request, spider)
        scheduler = self.scheduler_cls.from_crawler(self.crawler)
        start_requests = yield self.scraper.spidermw.process_start_requests(start_requests, spider)
        slot = Slot(start_requests, close_if_idle, nextcall, scheduler)
        self.slot = slot
        self.spider = spider
        yield scheduler.open(spider)
        yield self.scraper.open_spider(spider)
        self.crawler.stats.open_spider(spider)
        yield self.signals.send_catch_log_deferred(signals.spider_opened, spider=spider)
        slot.nextcall.schedule()
        slot.heartbeat.start(5)

First, check whether the engine slot is idle. Here, one engine can only run one spider
Then there are the initialization and startup of some attributes. We can divide them into three parts: one is the attribute in slot, the other is the attribute of crawler, and the last is the direct attribute of engine

First understand the usage of CallLaterOnce above

# scrapy/utils/reactor.py

class CallLaterOnce(object):

    def __init__(self, func, *a, **kw):
        self._func = func
        self._a = a
        self._kw = kw
        self._call = None

    def schedule(self, delay=0):
        if self._call is None:
            self._call = reactor.callLater(delay, self)

    def cancel(self):
        if self._call:
            self._call.cancel()

    def __call__(self):
        self._call = None
        return self._func(*self._a, **self._kw)

The purpose of this class is to implement the callLater function. The difference is that we can schedule multiple times. Only one task will be scheduled to the event loop within a certain time. If callLater is used directly, many tasks will be added to the event loop repeatedly, affecting the execution efficiency. Look at an example

def f(arg):
    # Request and other IO operations

nextcall = CallLaterOnce(f, 'some arg')
nextcall.schedule(5)
nextcall.schedule(5)
nextcall.schedule(5)

Although nextcall has been scheduled several times, only the first schedule is valid. Finally, only one task is added to the event loop. Scrapy is called in many places _ next_request ， There are concurrent operations inside. It is inevitable to call this method at the same time in a short time. Here, CallLaterOnce is used to reduce the processing pressure.

Look at the Slot constructor

# scrapy/core/engine.py

class Slot(object):

    def __init__(self, start_requests, close_if_idle, nextcall, scheduler):
        self.closing = False
        self.inprogress = set() # requests in progress
        self.start_requests = iter(start_requests)
        self.close_if_idle = close_if_idle
        self.nextcall = nextcall
        self.scheduler = scheduler
        self.heartbeat = task.LoopingCall(nextcall.schedule)

inprogress, which records the request currently being processed
start_requests, initialization requests of the crawler
nextcall, encapsulated _ next_request CallLaterOnce of method
scheduler, which stores requests to be crawled
heartbeat, which is executed every other period of time _ next_request

Take a brief look at the construction method of Scheduler

# scrapy/core/scheduler.py

class Scheduler(object):

    def __init__(self, dupefilter, jobdir=None, dqclass=None, mqclass=None,
                 logunser=False, stats=None, pqclass=None, crawler=None):
        self.df = dupefilter
        self.dqdir = self._dqdir(jobdir)
        self.pqclass = pqclass
        self.dqclass = dqclass
        self.mqclass = mqclass
        self.logunser = logunser
        self.stats = stats
        self.crawler = crawler

df, weight remover
dqdir, the persistence directory we specified when starting the crawler, jobdir
pqclass, priority queue type
dqclass and mqclass are disk and memory queue types respectively. When we specify jobdir, we mainly use disk queue
stats is mainly used to count the memory state. It is created in the Crawler constructor, and then transferred here level by level

The above classes refer to those classes, which can be seen from the Scheduler from_crawler method. For more detailed operations such as how to enter and exit the queue, we won't expand here. Those interested can see this class.

Posted by anthonyfellows on Sun, 07 Nov 2021 20:55:48 -0800

Programmer Group

Scarpy source code analysis 5

3.3 ExecutionEngine

Hot Keywords