Learn from python-day018 --- from Python distributed crawler to create search engine Scrapy

Keywords: Python Attribute

In section 341, Python distributed crawler creates search engine Scrapy elaboration - writes spiders crawler file to retrieve content in a circular way - the meta attribute returns the specified value to the callback function - Scrapy built-in Image Downloader

Write spiders crawler file to retrieve contents in a circular way

The Request() method adds the specified url address to the downloader download page. There are two required parameters,
Parameters:
  url='url'
callback = page processing function
Need yield Request() when using

The parse.urljoin() method is a method under the urllib library, which is automatic url splicing. If the url address of the second parameter is a relative path, it will be automatically spliced with the first parameter

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request                             #How to import the url and return it to the downloader
from urllib import parse                                    #Import parse module from urllib Library

class PachSpider(scrapy.Spider):
    name = 'pach'
    allowed_domains = ['blog.jobbole.com']                  #Initial domain name
    start_urls = ['http://blog.jobbole.com/all-posts/']     #Initial url

    def parse(self, response):
        """
        //Get the url address of the list page and give it to the downloader
        """
        #Get the current page article url
        lb_url = response.xpath('//a[@class="archive-title"]/@href').extract()  #Get article list url
        for i in lb_url:
            # print(parse.urljoin(response.url,i))                                             #The urljoin() method of parse module in urllib library is automatic url splicing. If the url address of the second parameter is relative path, it will be automatically spliced with the first parameter
            yield Request(url=parse.urljoin(response.url, i), callback=self.parse_wzhang)      #Add the cyclic article url to the downloader and give it to the parse callback function after downloading

        #Get the next page list url, give it to the downloader, and return it to the parse function loop
        x_lb_url = response.xpath('//a[@class="next page-numbers"]/@href').extract()         #Get next page article list url
        if x_lb_url:
            yield Request(url=parse.urljoin(response.url, x_lb_url[0]), callback=self.parse)     #Get the url of the next page and return it to the downloader, and call back to the parse function


    def parse_wzhang(self,response):
        title = response.xpath('//div[@class="entry-header"]/h1/text()').extract()           #Get article title
        print(title)

When the Request() function returns the url, it can also return a custom dictionary to the callback function through the meta property

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request                             #How to import the url and return it to the downloader
from urllib import parse                                    #Import parse module from urllib Library
from adc.items import AdcItem                               #Receiving class of imported items data receiving module

class PachSpider(scrapy.Spider):
    name = 'pach'
    allowed_domains = ['blog.jobbole.com']                  #Initial domain name
    start_urls = ['http://blog.jobbole.com/all-posts/']     #Initial url

    def parse(self, response):
        """
        //Get the url address of the list page and give it to the downloader
        """
        #Get the current page article url
        lb = response.css('div .post.floated-thumb')  #Get article list block, css selector
        # print(lb)
        for i in lb:
            lb_url = i.css('.archive-title ::attr(href)').extract_first('')     #Get the article url in the block
            # print(lb_url)

            lb_img = i.css('.post-thumb img ::attr(src)').extract_first('')     #Get thumbnails of articles in the block
            # print(lb_img)

            yield Request(url=parse.urljoin(response.url, lb_url), meta={'lb_img':parse.urljoin(response.url, lb_img)}, callback=self.parse_wzhang)      #Add the cyclic article url to the downloader and give it to the parse callback function after downloading

        #Get the next page list url, give it to the downloader, and return it to the parse function loop
        x_lb_url = response.css('.next.page-numbers ::attr(href)').extract_first('')         #Get next page article list url
        if x_lb_url:
            yield Request(url=parse.urljoin(response.url, x_lb_url), callback=self.parse)     #Get the url of the next page and return it to the downloader, and call back to the parse function


    def parse_wzhang(self,response):
        title = response.css('.entry-header h1 ::text').extract()           #Get article title
        # print(title)

        tp_img = response.meta.get('lb_img', '')                            #Receive the value from the meta and get it to prevent errors
        # print(tp_img)

        shjjsh = AdcItem()                                                                   #Instantiate data receiving class
        shjjsh['title'] = title                                                              #Transfer data to the specified class of items receiving module
        shjjsh['img'] = tp_img

        yield shjjsh                                #Return the received object to the pipelines.py processing module

Use of Scrapy's built-in picture downloader

Scrapy has built in a picture downloader in scrapy.pipelines.images.imagesipipeline, which is specially used to download pictures to the local place after crawler grabs the picture url

Step 1: after the crawler grabs the URL address of the picture, it fills in the container function of the items.py file

Crawler file

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request                             #How to import the url and return it to the downloader
from urllib import parse                                    #Import parse module from urllib Library
from adc.items import AdcItem                               #Receiving class of imported items data receiving module

class PachSpider(scrapy.Spider):
    name = 'pach'
    allowed_domains = ['blog.jobbole.com']                  #Initial domain name
    start_urls = ['http://blog.jobbole.com/all-posts/']     #Initial url

    def parse(self, response):
        """
        //Get the url address of the list page and give it to the downloader
        """
        #Get the current page article url
        lb = response.css('div .post.floated-thumb')  #Get article list block, css selector
        # print(lb)
        for i in lb:
            lb_url = i.css('.archive-title ::attr(href)').extract_first('')     #Get the article url in the block
            # print(lb_url)

            lb_img = i.css('.post-thumb img ::attr(src)').extract_first('')     #Get thumbnails of articles in the block
            # print(lb_img)

            yield Request(url=parse.urljoin(response.url, lb_url), meta={'lb_img':parse.urljoin(response.url, lb_img)}, callback=self.parse_wzhang)      #Add the cyclic article url to the downloader and give it to the parse callback function after downloading

        #Get the next page list url, give it to the downloader, and return it to the parse function loop
        x_lb_url = response.css('.next.page-numbers ::attr(href)').extract_first('')         #Get next page article list url
        if x_lb_url:
            yield Request(url=parse.urljoin(response.url, x_lb_url), callback=self.parse)     #Get the url of the next page and return it to the downloader, and call back to the parse function


    def parse_wzhang(self,response):
        title = response.css('.entry-header h1 ::text').extract()           #Get article title
        # print(title)

        tp_img = response.meta.get('lb_img', '')                            #Receive the value from meta and get it to prevent error
        # print(tp_img)

        shjjsh = AdcItem()                                                                   #Instantiate data receiving class
        shjjsh['title'] = title                                                              #Transfer data to the specified class of items receiving module
        shjjsh['img'] = [tp_img]

        yield shjjsh                                #Return the received object to the pipelines.py processing module

Step 2: set the container function of the items.py file to receive the data obtained by the crawler

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

#items.py, the file is specially used to receive the data information obtained by the crawler, which is equivalent to the container file

class AdcItem(scrapy.Item):    #Set the information container class obtained by the crawler
    title = scrapy.Field()     #Receive the title information obtained by the crawler
    img = scrapy.Field()       #Receive thumbnails
    img_tplj = scrapy.Field()  #Picture saving path

Step 3: use the crapy built-in Image Downloader in pipelines.py

1. First, introduce the built-in picture downloader

2. To customize a picture download, inherit the built-in imagesipipeline picture downloader class of crapy

3. Use the item [u completed() method in the imagesipipeline class to get the path to save the downloaded image

4. In the settings.py settings file, register the Custom Image Downloader class and set the image saving path

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.pipelines.images import ImagesPipeline  #Import picture downloader module


class AdcPipeline(object):                      #Define data processing class, must inherit object
    def process_item(self, item, spider):       #Process item (item) is a data processing function that receives an item. In the item is the data object from the last yield item of the crawler
        print('The title of the article is:' + item['title'][0])
        print('Post Thumbnails  url Yes,' + item['img'][0])
        print('The path to save the article thumbnail is:' + item['img_tplj'])  #Receive the path filled by the Image Downloader after image download

        return item

class imgPipeline(ImagesPipeline):                      #To customize a picture download, inherit the built-in imagesipipeline picture downloader class of crapy
    def item_completed(self, results, item, info):      #Use the item [u completed() method in the imagesipipeline class to get the path to save the downloaded image
        for ok, value in results:
            img_lj = value['path']     #Receive picture save path
            # print(ok)
            item['img_tplj'] = img_lj  #Fill the image saving path into the fields in items.py
        return item                    #Container function to give item to items.py file

    #Note: after setting the Custom Image Downloader, you need to

In the settings.py settings file, register the Custom Image Downloader class and set the image saving path

Images? URLs? Field sets the url address of the image to be downloaded. Generally, it is the received field in items.py
Images? Store to set the path to save pictures

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'adc.pipelines.AdcPipeline': 300,  #Register the adc.pipelines.AdcPipeline class, and the following numeric parameter indicates the execution level,
   'adc.pipelines.imgPipeline': 1,    #Register the Custom Image Downloader, the smaller the value, the more priority
}

IMAGES_URLS_FIELD = 'img'                             #Set the url field of the image to be downloaded, that is, the image is in the field of items.py
lujin = os.path.abspath(os.path.dirname(__file__))
IMAGES_STORE = os.path.join(lujin, 'img')             #Set picture saving path

Published 20 original articles, praised 0, visited 40
Private letter follow

Posted by yasir_memon on Mon, 17 Feb 2020 02:32:19 -0800