In section 341, Python distributed crawler creates search engine Scrapy elaboration - writes spiders crawler file to retrieve content in a circular way - the meta attribute returns the specified value to the callback function - Scrapy built-in Image Downloader
Write spiders crawler file to retrieve contents in a circular way
The Request() method adds the specified url address to the downloader download page. There are two required parameters,
Parameters:
url='url'
callback = page processing function
Need yield Request() when using
The parse.urljoin() method is a method under the urllib library, which is automatic url splicing. If the url address of the second parameter is a relative path, it will be automatically spliced with the first parameter
# -*- coding: utf-8 -*- import scrapy from scrapy.http import Request #How to import the url and return it to the downloader from urllib import parse #Import parse module from urllib Library class PachSpider(scrapy.Spider): name = 'pach' allowed_domains = ['blog.jobbole.com'] #Initial domain name start_urls = ['http://blog.jobbole.com/all-posts/'] #Initial url def parse(self, response): """ //Get the url address of the list page and give it to the downloader """ #Get the current page article url lb_url = response.xpath('//a[@class="archive-title"]/@href').extract() #Get article list url for i in lb_url: # print(parse.urljoin(response.url,i)) #The urljoin() method of parse module in urllib library is automatic url splicing. If the url address of the second parameter is relative path, it will be automatically spliced with the first parameter yield Request(url=parse.urljoin(response.url, i), callback=self.parse_wzhang) #Add the cyclic article url to the downloader and give it to the parse callback function after downloading #Get the next page list url, give it to the downloader, and return it to the parse function loop x_lb_url = response.xpath('//a[@class="next page-numbers"]/@href').extract() #Get next page article list url if x_lb_url: yield Request(url=parse.urljoin(response.url, x_lb_url[0]), callback=self.parse) #Get the url of the next page and return it to the downloader, and call back to the parse function def parse_wzhang(self,response): title = response.xpath('//div[@class="entry-header"]/h1/text()').extract() #Get article title print(title)
When the Request() function returns the url, it can also return a custom dictionary to the callback function through the meta property
# -*- coding: utf-8 -*- import scrapy from scrapy.http import Request #How to import the url and return it to the downloader from urllib import parse #Import parse module from urllib Library from adc.items import AdcItem #Receiving class of imported items data receiving module class PachSpider(scrapy.Spider): name = 'pach' allowed_domains = ['blog.jobbole.com'] #Initial domain name start_urls = ['http://blog.jobbole.com/all-posts/'] #Initial url def parse(self, response): """ //Get the url address of the list page and give it to the downloader """ #Get the current page article url lb = response.css('div .post.floated-thumb') #Get article list block, css selector # print(lb) for i in lb: lb_url = i.css('.archive-title ::attr(href)').extract_first('') #Get the article url in the block # print(lb_url) lb_img = i.css('.post-thumb img ::attr(src)').extract_first('') #Get thumbnails of articles in the block # print(lb_img) yield Request(url=parse.urljoin(response.url, lb_url), meta={'lb_img':parse.urljoin(response.url, lb_img)}, callback=self.parse_wzhang) #Add the cyclic article url to the downloader and give it to the parse callback function after downloading #Get the next page list url, give it to the downloader, and return it to the parse function loop x_lb_url = response.css('.next.page-numbers ::attr(href)').extract_first('') #Get next page article list url if x_lb_url: yield Request(url=parse.urljoin(response.url, x_lb_url), callback=self.parse) #Get the url of the next page and return it to the downloader, and call back to the parse function def parse_wzhang(self,response): title = response.css('.entry-header h1 ::text').extract() #Get article title # print(title) tp_img = response.meta.get('lb_img', '') #Receive the value from the meta and get it to prevent errors # print(tp_img) shjjsh = AdcItem() #Instantiate data receiving class shjjsh['title'] = title #Transfer data to the specified class of items receiving module shjjsh['img'] = tp_img yield shjjsh #Return the received object to the pipelines.py processing module
Use of Scrapy's built-in picture downloader
Scrapy has built in a picture downloader in scrapy.pipelines.images.imagesipipeline, which is specially used to download pictures to the local place after crawler grabs the picture url
Step 1: after the crawler grabs the URL address of the picture, it fills in the container function of the items.py file
Crawler file
# -*- coding: utf-8 -*- import scrapy from scrapy.http import Request #How to import the url and return it to the downloader from urllib import parse #Import parse module from urllib Library from adc.items import AdcItem #Receiving class of imported items data receiving module class PachSpider(scrapy.Spider): name = 'pach' allowed_domains = ['blog.jobbole.com'] #Initial domain name start_urls = ['http://blog.jobbole.com/all-posts/'] #Initial url def parse(self, response): """ //Get the url address of the list page and give it to the downloader """ #Get the current page article url lb = response.css('div .post.floated-thumb') #Get article list block, css selector # print(lb) for i in lb: lb_url = i.css('.archive-title ::attr(href)').extract_first('') #Get the article url in the block # print(lb_url) lb_img = i.css('.post-thumb img ::attr(src)').extract_first('') #Get thumbnails of articles in the block # print(lb_img) yield Request(url=parse.urljoin(response.url, lb_url), meta={'lb_img':parse.urljoin(response.url, lb_img)}, callback=self.parse_wzhang) #Add the cyclic article url to the downloader and give it to the parse callback function after downloading #Get the next page list url, give it to the downloader, and return it to the parse function loop x_lb_url = response.css('.next.page-numbers ::attr(href)').extract_first('') #Get next page article list url if x_lb_url: yield Request(url=parse.urljoin(response.url, x_lb_url), callback=self.parse) #Get the url of the next page and return it to the downloader, and call back to the parse function def parse_wzhang(self,response): title = response.css('.entry-header h1 ::text').extract() #Get article title # print(title) tp_img = response.meta.get('lb_img', '') #Receive the value from meta and get it to prevent error # print(tp_img) shjjsh = AdcItem() #Instantiate data receiving class shjjsh['title'] = title #Transfer data to the specified class of items receiving module shjjsh['img'] = [tp_img] yield shjjsh #Return the received object to the pipelines.py processing module
Step 2: set the container function of the items.py file to receive the data obtained by the crawler
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy #items.py, the file is specially used to receive the data information obtained by the crawler, which is equivalent to the container file class AdcItem(scrapy.Item): #Set the information container class obtained by the crawler title = scrapy.Field() #Receive the title information obtained by the crawler img = scrapy.Field() #Receive thumbnails img_tplj = scrapy.Field() #Picture saving path
Step 3: use the crapy built-in Image Downloader in pipelines.py
1. First, introduce the built-in picture downloader
2. To customize a picture download, inherit the built-in imagesipipeline picture downloader class of crapy
3. Use the item [u completed() method in the imagesipipeline class to get the path to save the downloaded image
4. In the settings.py settings file, register the Custom Image Downloader class and set the image saving path
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html from scrapy.pipelines.images import ImagesPipeline #Import picture downloader module class AdcPipeline(object): #Define data processing class, must inherit object def process_item(self, item, spider): #Process item (item) is a data processing function that receives an item. In the item is the data object from the last yield item of the crawler print('The title of the article is:' + item['title'][0]) print('Post Thumbnails url Yes,' + item['img'][0]) print('The path to save the article thumbnail is:' + item['img_tplj']) #Receive the path filled by the Image Downloader after image download return item class imgPipeline(ImagesPipeline): #To customize a picture download, inherit the built-in imagesipipeline picture downloader class of crapy def item_completed(self, results, item, info): #Use the item [u completed() method in the imagesipipeline class to get the path to save the downloaded image for ok, value in results: img_lj = value['path'] #Receive picture save path # print(ok) item['img_tplj'] = img_lj #Fill the image saving path into the fields in items.py return item #Container function to give item to items.py file #Note: after setting the Custom Image Downloader, you need to
In the settings.py settings file, register the Custom Image Downloader class and set the image saving path
Images? URLs? Field sets the url address of the image to be downloaded. Generally, it is the received field in items.py
Images? Store to set the path to save pictures
# Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'adc.pipelines.AdcPipeline': 300, #Register the adc.pipelines.AdcPipeline class, and the following numeric parameter indicates the execution level, 'adc.pipelines.imgPipeline': 1, #Register the Custom Image Downloader, the smaller the value, the more priority } IMAGES_URLS_FIELD = 'img' #Set the url field of the image to be downloaded, that is, the image is in the field of items.py lujin = os.path.abspath(os.path.dirname(__file__)) IMAGES_STORE = os.path.join(lujin, 'img') #Set picture saving path