Incremental and Distributed

Keywords: Redis encoding Windows pip

day14~Incremental and Distributed

Yesterday's review:

1. Installation of redis

1. Unzip the installation package into a folder, such as D: redis, where you will see all redis files
 2. Add this folder to system environment variables
 3. Enter cmd in the address bar of the decompressed file directory, redis-server. / redis. windows. conf in the cmd window, and then return. If the following picture appears, the redis installation is successful.

[Img-shSyltG1-15662179351 (C: Users Administrator AppData Roaming Typora typora-user-images 50167.png)]

2. Crawl Spider-based data crawling

# Project creation
scrapy startproject projectname
scrapy genspider -t crawl spidername www.baidu.com
# crawlspider data crawling:
- CrawlSpider is a reptile, a subclass of scrapy.spider, which is more powerful than spider.
- CrawlSpider's mechanism:
    - Connection extractor: Connection extraction can be performed according to specified rules
    - Rule parser: More specified rules for parsing response data
# Case study: Crawl the joke network depth data based on CrawlSpider, grab the joke title and content, and store it in MongoDB
# item encoding:
import scrapy
class JokeItem(scrapy.Item):
    title = scrapy.Field()
    content = scrapy.Field()
# spider coding:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from..items import JokeItem


class ZSpider(CrawlSpider):
    name = 'z'
    # allowed_domains = ['www.baidu.com']
    start_urls = ['http://xiaohua.zol.com.cn/lengxiaohua/']
    link = LinkExtractor(allow=r'/lengxiaohua/\d+.html')
    link_detail = LinkExtractor(allow=r'.*?\d+\.html')
    rules = (
        Rule(link, callback='parse_item', follow=True),
        Rule(link_detail, callback='parse_detail'),
    )

    def parse_item(self, response):
        pass

    def parse_detail(self, response):
        title = response.xpath('//h1[@class="article-title"]/text()').extract_first()
        content = response.xpath('//div[@class="article-text"]//text()').extract()
        content = ''.join(content)

        if title and content:
            item = JokeItem()
            item["title"] = title
            item["content"] = content
            print(dict(item))
            yield item
# pipeline coding:
class JokePipeline(object):

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DB')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def process_item(self, item, spider):
        self.db["joke"].insert(dict(item))
        return item

    def close(self, spider):
        self.client.close()
# Film Paradise: Full-site in-depth capture of movie names and download links:
# item defines the storage field:
import scrapy


class BossItem(scrapy.Item):
    title = scrapy.Field()
    downloadlink = scrapy.Field()
# spider coding:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import MvItem


class BSpider(CrawlSpider):
    name = 'mv'
    # allowed_domains = ['www.baidu.com']
    start_urls = ['https://www.ygdy8.net/html/gndy/oumei/index.html']
    link = LinkExtractor(allow=r'list.*?html')
    link_detail = LinkExtractor(allow=r'.*?/\d+\.html')
    rules = (
        Rule(link, callback='parse_item', follow=True,),
        Rule(link_detail, callback='parse_detail', follow=True,),
    )

    def parse_item(self, response):
        pass

    def parse_detail(self, response):
        title = response.xpath('//h1//text()').extract_first()
        downloadlink = response.xpath('//tbody/tr/td/a/text()').extract_first()
        if title and downloadlink and 'ftp' in downloadlink:
            item = BossItem()
            item['title'] = title
            item['downloadlink'] = downloadlink
            yield item
# piplines encoding:
class MvPipeline(object):

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DB')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def process_item(self, item, spider):
        self.db["mv"].insert(dict(item))
        return item

    def close(self, spider):
        self.client.close()

3. Distributed

# Distributed concepts:
Using multiple machines to form a distributed cluster, running the same group of programs in the cluster, to crawl joint data.

# Native scrapy can't achieve distributed reasons:
	- Schedulers in native scrapy cannot be shared
	- Native scrapy pipes cannot be shared

# Using scrapy to realize distributed thinking:
- Provide shared pipes and schedulers for native scrapy frameworks
- pip install scrapy_redis
- 1. Creating Engineering: scrapy startproject projectname
- 2. Crawler files: scrapy genspider -t crawl spidername www.baidu.com
- 3. Modify the crawler file:
	- 3.1 Guide packages: from scrapy_redis.spiders import RedisCrawlSpider
	- 3.2 Modify the parent of the current reptile RedisCrawlSpider
	- 3.3 allowed_domains,start_url Comment it out and add a new attribute redis_key='qn'(Name of scheduler queue)
	- 3.4 Appoint redis_key = 'xxx' , That is, the name of the shared scheduler queue
	- 3.4 Data parsing, encapsulating parsed data into item Then submit to the pipeline
- 4. Configuration file preparation:
	- 4.1 Designated Pipeline:
		ITEM_PIPELINES = {
			'scrapy_redis.pipelines.RedisPipeline': 400
		}
	- 4.2 Specify a scheduler:
		# Added a configuration of the de-duplicate container class to store the fingerprint data of requests using Redis set set set set set set set, thus achieving the persistence of request de-duplication.
		DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
		# Use scrapy-redis component to own scheduler
		SCHEDULER = "scrapy_redis.scheduler.Scheduler"
		# Configuration scheduler persistence, that is, when the crawler is over, do not empty the request queue in Redis and set with multiple fingerprints. If it is True, it means to persist the storage, not empty the data, otherwise empty the data.
		SCHEDULER_PERSIST = True
	- 4.3 Specify specific redis: 
		REDIS_HOST = 'redis Service ip address'
		REDIS_PORT = 6379
- 5. modify Redis Configure and specify configuration startup:
	- #bind 127.0.0.1
	- protected-mode no
	- open redis service(carry redis Configuration file: redis-server ./redis.windows.conf),And client(redis-cli): 

- 6. Start the program: scrapy runspider xxx.py(Need to enter spider Folder)
- 7. Throw a starting one into the scheduler queue url(redis Client: lpush xxx www.xxx.com
	(xxx That's what it means. redis_key Value)
# Case: Complaint information crawling of Sunshine Hotline Politics Platform
 # Website: http://wz.sun0769.com/index.php/question/questionType?type=4
# items encoding:
import scrapy
class FbsproItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
# spider coding:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_redis.spiders import RedisCrawlSpider
from fbspro.items import FbsproItem  
class TestSpider(RedisCrawlSpider):
    name = 'test'  
    # allowed_domains = ['ww.baidu.com']
    # start_urls = ['http://ww.baidu.com/']
    redis_key = 'urlscheduler'
    link = LinkExtractor(allow=r'.*?&page=\d+')
    rules = (
        Rule(link, callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        a_lst = response.xpath('//a[@class="news14"]')
        for a in a_lst:
            title = a.xpath('./text()').extract_first()
            # print(title)
            item = FbsproItem()
            item['title'] = title
            yield item

# settings configuration code:
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'
ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 3
ITEM_PIPELINES = {
   # 'fbspro.pipelines.FbsproPipeline': 300,
    'scrapy_redis.pipelines.RedisPipeline': 400
}
# Added a configuration of the de-duplicate container class to store the fingerprint data of requests using Redis set set set set set set set, thus achieving the persistence of request de-duplication.
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# Use scrapy-redis component to own scheduler
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# Configuration scheduler persistence, that is, when the crawler is over, do not empty the request queue in Redis and set with multiple fingerprints. If it is True, it means to persist the storage, not empty the data, otherwise empty the data.
SCHEDULER_PERSIST = True

# redis configuration
REDIS_HOST = '192.168.12.198'
REDIS_PORT = 6379

4. Incremental Formula

# Concept:
	- Detecting website data updates, crawling only updated content
	- Core: weight removal
        - url
        - Data fingerprint
# Incremental Crawler: Crawling of Film Names and Film Types
# url: https://www.4567tv.co/list/index1.html
# items encoding:
import scrapy
class MvproItem(scrapy.Item):
    title = scrapy.Field()
    position = scrapy.Field()
# spider coding:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from redis import Redis
from mvpro.items import MvproItem


class MoveSpider(CrawlSpider):
    conn = Redis('127.0.0.1', 6379)
    name = 'move'
    # allowed_domains = ['www.baidu.com']
    start_urls = ['https://www.4567tv.co/list/index1.html']
    link = LinkExtractor(allow=r'/list/index1-\d+\.html')
    rules = (
        Rule(link, callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        li_list = response.xpath('//div[contains(@class, "index-area")]/ul/li')
        for li in li_list:
            mv_link = 'https://www.4567tv.co' + li.xpath('./a/@href').extract_first()
            ex = self.conn.sadd('mv_link', mv_link)
            if ex:
                print('New data can be crawled..........................')
                yield scrapy.Request(url=mv_link, callback=self.parse_detail)
            else:
                print('No new data to crawl!!!!!!!!!!!!!!!!!!!!!!!!!')

    def parse_detail(self, response):
        title = response.xpath('//dt[@class="name"]/text()').extract_first()
        pro = response.xpath('//div[@class="ee"]/text()').extract_first()
        item = MvproItem()
        item['title'] = title
        item['position'] = pro
        yield item
# Requirement: Incremental crawler based on data fingerprint to crawl all kinds of characters
# spider coding:
import scrapy
from qiubai.items import QiubaiItem
import hashlib
from redis import Redis

class QbSpider(scrapy.Spider):
    conn = Redis('127.0.0.1', 6379)
    name = 'qb'
    # allowed_domains = ['www.baidu.com']
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):
        div_list = response.xpath('//div[@id="content-left"]/div')

        for div in div_list:
            content = div.xpath('./a[1]/div[@class="content"]/span[1]/text()').extract_first()
            fp = hashlib.md5(content.encode('utf-8')).hexdigest()
            ex = self.conn.sadd('fp', fp)
            if ex:
                print('Update data needs to be crawled........................')
                item = QiubaiItem()
                item['content'] = content
                yield item
            else:
                print('No data updates!!!!!!!!!!!!!!!!!!!!!!!!')

5.scrapy improves data crawling efficiency

1. Increase concurrency:
The default number of concurrent threads opened by scrapy is 32, which can be increased appropriately. Modify CONCURRENT_REQUESTS = 100 in the settings configuration file and set it to 100 concurrently.

2. Reduce the log level:
    When running scrapy, there will be a lot of log information output, in order to reduce CPU usage. Log output information can be set to INFO or ERROR. Write in the configuration file: LOG_LEVEL='INFO'

3. Forbid cookie s:
    If cookies are not really needed, cookies can be banned when scrapy crawls data, thus reducing CPU usage and improving crawling efficiency. Write in the configuration file: COOKIES_ENABLED = False

4. No retries:
    Re-requesting (retrying) failed HTTP slows down the crawling speed, so retrying can be prohibited. Write in the configuration file: RETRY_ENABLED = False

5. Reduce download timeouts:
    If you crawl a very slow link, reducing download timeouts can make stuck links quickly abandoned, thereby improving efficiency. Write in the configuration file: DOWNLOAD_TIMEOUT = 10 timeout time is 10s

6. Virtual Environment

# Installation:
pip install virtualenvwrapper-win

# Common commands:
mkvirtualenv envname  # Create a virtual environment and switch to it automatically
workon envname  # Switching to a Virtual Environment
pip list 
rmvirtualenv envname  # Delete virtual environment
deactivate  # Exit from Virtual Environment
lsvirtualenv  # List all common virtual environments
mkvirtualenv --python==C:\...\python.exe envname  # Specify the Python interpreter to create a virtual environment

Posted by Corona4456 on Mon, 19 Aug 2019 05:50:17 -0700