day14~Incremental and Distributed
Yesterday's review:
1. Installation of redis
1. Unzip the installation package into a folder, such as D: redis, where you will see all redis files 2. Add this folder to system environment variables 3. Enter cmd in the address bar of the decompressed file directory, redis-server. / redis. windows. conf in the cmd window, and then return. If the following picture appears, the redis installation is successful.
[Img-shSyltG1-15662179351 (C: Users Administrator AppData Roaming Typora typora-user-images 50167.png)]
2. Crawl Spider-based data crawling
# Project creation scrapy startproject projectname scrapy genspider -t crawl spidername www.baidu.com
# crawlspider data crawling: - CrawlSpider is a reptile, a subclass of scrapy.spider, which is more powerful than spider. - CrawlSpider's mechanism: - Connection extractor: Connection extraction can be performed according to specified rules - Rule parser: More specified rules for parsing response data
# Case study: Crawl the joke network depth data based on CrawlSpider, grab the joke title and content, and store it in MongoDB
# item encoding: import scrapy class JokeItem(scrapy.Item): title = scrapy.Field() content = scrapy.Field()
# spider coding: import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from..items import JokeItem class ZSpider(CrawlSpider): name = 'z' # allowed_domains = ['www.baidu.com'] start_urls = ['http://xiaohua.zol.com.cn/lengxiaohua/'] link = LinkExtractor(allow=r'/lengxiaohua/\d+.html') link_detail = LinkExtractor(allow=r'.*?\d+\.html') rules = ( Rule(link, callback='parse_item', follow=True), Rule(link_detail, callback='parse_detail'), ) def parse_item(self, response): pass def parse_detail(self, response): title = response.xpath('//h1[@class="article-title"]/text()').extract_first() content = response.xpath('//div[@class="article-text"]//text()').extract() content = ''.join(content) if title and content: item = JokeItem() item["title"] = title item["content"] = content print(dict(item)) yield item
# pipeline coding: class JokePipeline(object): def __init__(self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler(cls, crawler): return cls( mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DB') ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def process_item(self, item, spider): self.db["joke"].insert(dict(item)) return item def close(self, spider): self.client.close()
# Film Paradise: Full-site in-depth capture of movie names and download links:
# item defines the storage field: import scrapy class BossItem(scrapy.Item): title = scrapy.Field() downloadlink = scrapy.Field()
# spider coding: import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from ..items import MvItem class BSpider(CrawlSpider): name = 'mv' # allowed_domains = ['www.baidu.com'] start_urls = ['https://www.ygdy8.net/html/gndy/oumei/index.html'] link = LinkExtractor(allow=r'list.*?html') link_detail = LinkExtractor(allow=r'.*?/\d+\.html') rules = ( Rule(link, callback='parse_item', follow=True,), Rule(link_detail, callback='parse_detail', follow=True,), ) def parse_item(self, response): pass def parse_detail(self, response): title = response.xpath('//h1//text()').extract_first() downloadlink = response.xpath('//tbody/tr/td/a/text()').extract_first() if title and downloadlink and 'ftp' in downloadlink: item = BossItem() item['title'] = title item['downloadlink'] = downloadlink yield item
# piplines encoding: class MvPipeline(object): def __init__(self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler(cls, crawler): return cls( mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DB') ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def process_item(self, item, spider): self.db["mv"].insert(dict(item)) return item def close(self, spider): self.client.close()
3. Distributed
# Distributed concepts: Using multiple machines to form a distributed cluster, running the same group of programs in the cluster, to crawl joint data. # Native scrapy can't achieve distributed reasons: - Schedulers in native scrapy cannot be shared - Native scrapy pipes cannot be shared # Using scrapy to realize distributed thinking: - Provide shared pipes and schedulers for native scrapy frameworks - pip install scrapy_redis
- 1. Creating Engineering: scrapy startproject projectname - 2. Crawler files: scrapy genspider -t crawl spidername www.baidu.com - 3. Modify the crawler file: - 3.1 Guide packages: from scrapy_redis.spiders import RedisCrawlSpider - 3.2 Modify the parent of the current reptile RedisCrawlSpider - 3.3 allowed_domains,start_url Comment it out and add a new attribute redis_key='qn'(Name of scheduler queue) - 3.4 Appoint redis_key = 'xxx' , That is, the name of the shared scheduler queue - 3.4 Data parsing, encapsulating parsed data into item Then submit to the pipeline - 4. Configuration file preparation: - 4.1 Designated Pipeline: ITEM_PIPELINES = { 'scrapy_redis.pipelines.RedisPipeline': 400 } - 4.2 Specify a scheduler: # Added a configuration of the de-duplicate container class to store the fingerprint data of requests using Redis set set set set set set set, thus achieving the persistence of request de-duplication. DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # Use scrapy-redis component to own scheduler SCHEDULER = "scrapy_redis.scheduler.Scheduler" # Configuration scheduler persistence, that is, when the crawler is over, do not empty the request queue in Redis and set with multiple fingerprints. If it is True, it means to persist the storage, not empty the data, otherwise empty the data. SCHEDULER_PERSIST = True - 4.3 Specify specific redis: REDIS_HOST = 'redis Service ip address' REDIS_PORT = 6379 - 5. modify Redis Configure and specify configuration startup: - #bind 127.0.0.1 - protected-mode no - open redis service(carry redis Configuration file: redis-server ./redis.windows.conf),And client(redis-cli): - 6. Start the program: scrapy runspider xxx.py(Need to enter spider Folder) - 7. Throw a starting one into the scheduler queue url(redis Client: lpush xxx www.xxx.com (xxx That's what it means. redis_key Value)
# Case: Complaint information crawling of Sunshine Hotline Politics Platform # Website: http://wz.sun0769.com/index.php/question/questionType?type=4
# items encoding: import scrapy class FbsproItem(scrapy.Item): # define the fields for your item here like: title = scrapy.Field()
# spider coding: import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from scrapy_redis.spiders import RedisCrawlSpider from fbspro.items import FbsproItem class TestSpider(RedisCrawlSpider): name = 'test' # allowed_domains = ['ww.baidu.com'] # start_urls = ['http://ww.baidu.com/'] redis_key = 'urlscheduler' link = LinkExtractor(allow=r'.*?&page=\d+') rules = ( Rule(link, callback='parse_item', follow=True), ) def parse_item(self, response): a_lst = response.xpath('//a[@class="news14"]') for a in a_lst: title = a.xpath('./text()').extract_first() # print(title) item = FbsproItem() item['title'] = title yield item
# settings configuration code: USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36' ROBOTSTXT_OBEY = False CONCURRENT_REQUESTS = 3 ITEM_PIPELINES = { # 'fbspro.pipelines.FbsproPipeline': 300, 'scrapy_redis.pipelines.RedisPipeline': 400 } # Added a configuration of the de-duplicate container class to store the fingerprint data of requests using Redis set set set set set set set, thus achieving the persistence of request de-duplication. DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # Use scrapy-redis component to own scheduler SCHEDULER = "scrapy_redis.scheduler.Scheduler" # Configuration scheduler persistence, that is, when the crawler is over, do not empty the request queue in Redis and set with multiple fingerprints. If it is True, it means to persist the storage, not empty the data, otherwise empty the data. SCHEDULER_PERSIST = True # redis configuration REDIS_HOST = '192.168.12.198' REDIS_PORT = 6379
4. Incremental Formula
# Concept: - Detecting website data updates, crawling only updated content - Core: weight removal - url - Data fingerprint
# Incremental Crawler: Crawling of Film Names and Film Types # url: https://www.4567tv.co/list/index1.html
# items encoding: import scrapy class MvproItem(scrapy.Item): title = scrapy.Field() position = scrapy.Field()
# spider coding: import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from redis import Redis from mvpro.items import MvproItem class MoveSpider(CrawlSpider): conn = Redis('127.0.0.1', 6379) name = 'move' # allowed_domains = ['www.baidu.com'] start_urls = ['https://www.4567tv.co/list/index1.html'] link = LinkExtractor(allow=r'/list/index1-\d+\.html') rules = ( Rule(link, callback='parse_item', follow=True), ) def parse_item(self, response): li_list = response.xpath('//div[contains(@class, "index-area")]/ul/li') for li in li_list: mv_link = 'https://www.4567tv.co' + li.xpath('./a/@href').extract_first() ex = self.conn.sadd('mv_link', mv_link) if ex: print('New data can be crawled..........................') yield scrapy.Request(url=mv_link, callback=self.parse_detail) else: print('No new data to crawl!!!!!!!!!!!!!!!!!!!!!!!!!') def parse_detail(self, response): title = response.xpath('//dt[@class="name"]/text()').extract_first() pro = response.xpath('//div[@class="ee"]/text()').extract_first() item = MvproItem() item['title'] = title item['position'] = pro yield item
# Requirement: Incremental crawler based on data fingerprint to crawl all kinds of characters
# spider coding: import scrapy from qiubai.items import QiubaiItem import hashlib from redis import Redis class QbSpider(scrapy.Spider): conn = Redis('127.0.0.1', 6379) name = 'qb' # allowed_domains = ['www.baidu.com'] start_urls = ['https://www.qiushibaike.com/text/'] def parse(self, response): div_list = response.xpath('//div[@id="content-left"]/div') for div in div_list: content = div.xpath('./a[1]/div[@class="content"]/span[1]/text()').extract_first() fp = hashlib.md5(content.encode('utf-8')).hexdigest() ex = self.conn.sadd('fp', fp) if ex: print('Update data needs to be crawled........................') item = QiubaiItem() item['content'] = content yield item else: print('No data updates!!!!!!!!!!!!!!!!!!!!!!!!')
5.scrapy improves data crawling efficiency
1. Increase concurrency: The default number of concurrent threads opened by scrapy is 32, which can be increased appropriately. Modify CONCURRENT_REQUESTS = 100 in the settings configuration file and set it to 100 concurrently. 2. Reduce the log level: When running scrapy, there will be a lot of log information output, in order to reduce CPU usage. Log output information can be set to INFO or ERROR. Write in the configuration file: LOG_LEVEL='INFO' 3. Forbid cookie s: If cookies are not really needed, cookies can be banned when scrapy crawls data, thus reducing CPU usage and improving crawling efficiency. Write in the configuration file: COOKIES_ENABLED = False 4. No retries: Re-requesting (retrying) failed HTTP slows down the crawling speed, so retrying can be prohibited. Write in the configuration file: RETRY_ENABLED = False 5. Reduce download timeouts: If you crawl a very slow link, reducing download timeouts can make stuck links quickly abandoned, thereby improving efficiency. Write in the configuration file: DOWNLOAD_TIMEOUT = 10 timeout time is 10s
6. Virtual Environment
# Installation: pip install virtualenvwrapper-win
# Common commands: mkvirtualenv envname # Create a virtual environment and switch to it automatically workon envname # Switching to a Virtual Environment pip list rmvirtualenv envname # Delete virtual environment deactivate # Exit from Virtual Environment lsvirtualenv # List all common virtual environments mkvirtualenv --python==C:\...\python.exe envname # Specify the Python interpreter to create a virtual environment