2021SC@SDUSC
3.3 ExecutionEngine
look down open_spider
# scrapy/core/engine.py class ExecutionEngine(object): # ... @defer.inlineCallbacks def open_spider(self, spider, start_requests=(), close_if_idle=True): assert self.has_capacity(), "No free spider slot when opening %r" % \ spider.name logger.info("Spider opened", extra={'spider': spider}) nextcall = CallLaterOnce(self._next_request, spider) scheduler = self.scheduler_cls.from_crawler(self.crawler) start_requests = yield self.scraper.spidermw.process_start_requests(start_requests, spider) slot = Slot(start_requests, close_if_idle, nextcall, scheduler) self.slot = slot self.spider = spider yield scheduler.open(spider) yield self.scraper.open_spider(spider) self.crawler.stats.open_spider(spider) yield self.signals.send_catch_log_deferred(signals.spider_opened, spider=spider) slot.nextcall.schedule() slot.heartbeat.start(5)
- First, check whether the engine slot is idle. Here, one engine can only run one spider
- Then there are the initialization and startup of some attributes. We can divide them into three parts: one is the attribute in slot, the other is the attribute of crawler, and the last is the direct attribute of engine
First understand the usage of CallLaterOnce above
# scrapy/utils/reactor.py class CallLaterOnce(object): def __init__(self, func, *a, **kw): self._func = func self._a = a self._kw = kw self._call = None def schedule(self, delay=0): if self._call is None: self._call = reactor.callLater(delay, self) def cancel(self): if self._call: self._call.cancel() def __call__(self): self._call = None return self._func(*self._a, **self._kw)
The purpose of this class is to implement the callLater function. The difference is that we can schedule multiple times. Only one task will be scheduled to the event loop within a certain time. If callLater is used directly, many tasks will be added to the event loop repeatedly, affecting the execution efficiency. Look at an example
def f(arg): # Request and other IO operations nextcall = CallLaterOnce(f, 'some arg') nextcall.schedule(5) nextcall.schedule(5) nextcall.schedule(5)
Although nextcall has been scheduled several times, only the first schedule is valid. Finally, only one task is added to the event loop. Scrapy is called in many places _ next_request , There are concurrent operations inside. It is inevitable to call this method at the same time in a short time. Here, CallLaterOnce is used to reduce the processing pressure.
Look at the Slot constructor
# scrapy/core/engine.py class Slot(object): def __init__(self, start_requests, close_if_idle, nextcall, scheduler): self.closing = False self.inprogress = set() # requests in progress self.start_requests = iter(start_requests) self.close_if_idle = close_if_idle self.nextcall = nextcall self.scheduler = scheduler self.heartbeat = task.LoopingCall(nextcall.schedule)
- inprogress, which records the request currently being processed
- start_requests, initialization requests of the crawler
- nextcall, encapsulated _ next_request CallLaterOnce of method
- scheduler, which stores requests to be crawled
- heartbeat, which is executed every other period of time _ next_request
Take a brief look at the construction method of Scheduler
# scrapy/core/scheduler.py class Scheduler(object): def __init__(self, dupefilter, jobdir=None, dqclass=None, mqclass=None, logunser=False, stats=None, pqclass=None, crawler=None): self.df = dupefilter self.dqdir = self._dqdir(jobdir) self.pqclass = pqclass self.dqclass = dqclass self.mqclass = mqclass self.logunser = logunser self.stats = stats self.crawler = crawler
- df, weight remover
- dqdir, the persistence directory we specified when starting the crawler, jobdir
- pqclass, priority queue type
- dqclass and mqclass are disk and memory queue types respectively. When we specify jobdir, we mainly use disk queue
- stats is mainly used to count the memory state. It is created in the Crawler constructor, and then transferred here level by level
The above classes refer to those classes, which can be seen from the Scheduler from_crawler method. For more detailed operations such as how to enter and exit the queue, we won't expand here. Those interested can see this class.