chapter 5 using Item Pipeline to process data

Keywords: Python Data Mining

Chapter 5 using Item Pipeline to process data

  in the previous chapter, we learned the methods of extracting data and encapsulating data. In this chapter, we learn how to process the crawled data. In scripy, an Item Pipeline is a component that processes data. An Item Pipeline is a class that contains a specific interface. It is usually responsible for data processing of only one function. Multiple items can be enabled at the same time in a project
Pipeline, which are cascaded in a specified order to form a data processing pipeline.
  the following are typical applications of Item Pipeline:
● cleaning data.
● verify the validity of the data.
● filter out duplicate data.
● store the data in the database.

5.1 Item Pipeline

   explain the use of Item Pipeline through an example. In the example item in Chapter 1, the price of the book we crawled is in pounds sterling:

$ scrapy crawl books -o books.csv
...
$ head -5 books.csv # View the 5 lines at the beginning of the file
name,price
A Light in the Attic,£51.77
Tipping the Velvet,£53.74
Soumission,£50.10
Sharp Objects,£47.82

    if we expect the book price to be RMB, we need to calculate the RMB price by multiplying the sterling price by the exchange rate (processing data). At this time, we can realize an Item Pipeline for price conversion to complete this work. Next, implement it in the example project.

5.1.1 implement Item Pipeline

    when creating a Scrapy project, a pipelines.py file will be automatically generated, which is used to place the user-defined Item Pipeline. The PriceConverterPipeline is implemented in pipelines.py of example project. The code is as follows:

class PriceConverterPipeline(object):
	# Exchange rate of sterling to RMB
	exchange_rate = 8.5309
	
	def process_item(self, item, spider):
		# Extract the price field of item (e.g. £ 53.74)
		# Remove the previous pound sign £ and convert it to float type and multiply it by the exchange rate
		price = float(item['price'][1:]) * self.exchange_rate
		# Keep 2 decimal places and assign it back to the price field of item
		item['price'] = '¥%.2f' % price
		return item

The above codes are explained as follows:
● an Item Pipeline does not need to inherit a specific base class, but only needs to implement some specific methods, such as process_item,open_spider,close_spider.
● an Item Pipeline must implement a process_item(item, spider) method, which is used to process each item of data crawled by the Spider, including two parameters:
* an Item of data (Item or dictionary) crawled by Item.
* Spider crawls the Spider object of this data.
  process in the above code_ The implementation of the item method is very simple. Convert the sterling price of the book into a floating point number, multiply it by the exchange rate and keep two decimal places, then assign it back to the price field of the item, and finally return the processed item.
  as you can see, process_ The item method is the core of the Item Pipeline. Two additional explanations need to be made for this method:
● if process_item returns an item of data (item or dictionary) when processing an item. The returned data will be delivered to the next level Item Pipeline (if any) for further processing.
● if process_item throws (raise) a DropItem exception (scratch. Exceptions. DropItem) when processing an item. The item will be discarded and will not be delivered to the subsequent Item Pipeline for further processing or exported to a file. Usually, we throw a DropItem exception when we detect invalid data or want to filter data.
  in addition to the process that must be implemented_ In addition to the item method, there are three commonly used methods that can be selected and implemented according to requirements:
● open_spider(self, spider)
   this method is called back when the Spider is opened (before data processing). Usually, this method is used to complete some initialization work, such as connecting to the database, before starting data processing.
● close_spider(self, spider)
   this method is called back when the Spider is closed (after processing data). Usually, this method is used to complete some cleaning work after processing all data, such as closing the database.
● from_crawler(cls, crawler)
   call back this kind of method when creating Item Pipeline object. Usually, in this method, the configuration is read through crawler.settings, and the Item Pipeline object is created according to the configuration.
  in the following example, we show the application scenario of the above method.

5.1.2 enable Item Pipeline

  Item Pipeline is an optional component in scratch. If you want to enable one (or some) Item Pipeline, you need to configure it in the configuration file settings.py:

ITEM_PIPELINES = {
'example.pipelines.PriceConverterPipeline': 300,
}

  ITEM_PIPELINES is a dictionary. We add the Item Pipeline we want to enable to this dictionary. The key of each item is the import path of each Item Pipeline class, and the value is a number from 0 to 1000. When multiple item pipelines are enabled at the same time, Scrapy determines the data processing order of each Item Pipeline according to these values, and the smaller value is the first.
   after enabling PriceConverterPipeline, run the crawler again and observe the results:

$ scrapy crawl books -o books.csv
...
$ head -5 books.csv # View the 5 lines at the beginning of the file
name,price
A Light in the Attic,¥441.64
Tipping the Velvet,¥458.45
Soumission,¥427.40
Sharp Objects,¥407.95

5.2 more examples

  we learned how to use Item Pipeline to process data through an example. Let's look at two practical examples.

5.2.1 filtering duplicate data

  to ensure that there are no duplicate items in the crawled book information, you can implement a de duplication Item Pipeline. Here, we use the book name as the primary key (actually, the ISBN number should be used as the primary key, but only the book name and price are crawled) for de duplication. The code of duplicates pipeline is as follows:

from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
	def __init__(self):
		self.book_set = set() 
		
	def process_item(self, item, spider):
		name = item['name']
		if name in self.book_set:
			raise DropItem("Duplicate book found: %s" % item)
		self.book_set.add(name)
		return item

The above codes are explained as follows:
● add a constructor method in which to initialize the set used to de duplicate the book title.
● in process_ In the item method, first take out the name field of item and check whether the book name is in the collection book_ In set, if it exists, it is duplicate data and thrown
DropItem exception, discarding item; Otherwise, save the name field of item into the collection and return item.
  next, test the duplicates pipeline. First, run the crawler without enabling DuplicatesPipeline to view the results:

$ scrapy crawl books -o book1.csv
...
$ cat -n book1.csv
1 price,name
2 ¥441.64,A Light in the Attic
3 ¥458.45,Tipping the Velvet
4 ¥427.40,Soumission
5 ¥407.95,Sharp Objects
6 ¥462.63,Sapiens: A Brief History of Humankind
7 ¥193.22,The Requiem Red
8 ¥284.42,The Dirty Little Secrets of Getting Your Dream J
...
993 ¥317.86,Bounty (Colorado Mountain #7)
994 ¥173.18,Blood Defense (Samantha Brinkman #1)
995 ¥295.60,"Bleach, Vol. 1: Strawberry and the Soul Reapers (Bleach 996 ¥370.07,Beyond Good and Evil
997 ¥473.72,Alice in Wonderland (Alice's Adventures in Wonderland #998 ¥486.77,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)"
999 ¥144.77,A Spy's Devotion (The Regency Spies of London #1)
1000 ¥460.50,1st to Die (Women's Murder Club #1)
1001 ¥222.49,"1,000 Places to See Before You Die"

  there are 1000 books at this time.
    then enable duplicatepipeline in the configuration file settings.py:

ITEM_PIPELINES = {
'example.pipelines.PriceConverterPipeline': 300,
'example.pipelines.DuplicatesPipeline': 350,
}
  run the crawler and compare the results

$ scrapy crawl books -o book2.csv
...
$ cat -n book2.csv
1 name,price
2 A Light in the Attic,¥441.64
3 Tipping the Velvet,¥458.45
4 Soumission,¥427.40
5 Sharp Objects,¥407.95
6 Sapiens: A Brief History of Humankind,¥462.63
7 The Requiem Red,¥193.22
8 The Dirty Little Secrets of Getting Your Dream Job,¥284.42
...
993 Blood Defense (Samantha Brinkman #1),¥173.18
994 "Bleach, Vol. 1: Strawberry and the Soul Reapers (Bleach #1)",995 Beyond Good and Evil,¥370.07
996 Alice in Wonderland (Alice's Adventures in Wonderland #1),¥473.72
997 "Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",¥486.77
998 A Spy's Devotion (The Regency Spies of London #1),¥144.77
999 1st to Die (Women's Murder Club #1),¥460.50
1000 "1,000 Places to See Before You Die",¥222.49

   there are only 999 books, one less than before, indicating that there are two books with the same name. You can find duplicates by browsing the crawler's log information:

[scrapy.core.scraper] WARNING: Dropped: Duplicate book found:
{'name': 'The Star-Touched Queen', 'price': '¥275.55'}

5.2.2 store data in MongoDB

    sometimes, we want to store the crawled data in a database, and we can implement Item Pipeline to complete such tasks. The following implements an Item Pipeline that can store data into MongoDB database. The code is as follows:

from scrapy.item import Item
import pymongo

class MongoDBP(object):
	DB_URI ='mongdb://localhost:27017/'
	DB_NAME ='scrapy_data'
	
	def open_spider(self,spider):
		self.client =pymongo.MongoClient(self.DB_URI)
		self.db =self.client[self.DB_NAME]
	
	def close_spider(self, spider):
		self.client.close()
	
	def process_item(self,item, spider):
		collection = self.db[spider.name]
		post =dict(item) if isinstance(item, Item else item)
		collection.insert_one(post)
		return item

  the above codes are explained as follows.
● define two constants in class attributes:
 * DB_URI address of the URI database.
 * DB_NAME the name of the database.
● in the whole process of Spider crawling, the database connection and closing operation only need to be carried out once. The database should be connected before data processing, and the database should be closed after all data processing. Therefore, the following two methods are implemented (called when the Spider is opened and closed):
 * open_spider(spider)
 * close_spider(spider)
  open respectively_ Spider and close_ The spider method realizes the connection and closing of the database.
● in process_item implements the write operation of MongoDB database, obtains a collection using self.db and spider.name, and then inserts the data into the collection and inserts the collection object_ The one method needs to pass in a dictionary object (item object cannot be passed in), so judge the type of item before calling. If item is an item object, convert it to a dictionary.
  next, test MongoDBPipeline and enable MongoDBPipeline in the configuration file settings.py:

 ITEM_PIPELINES = {
'example.pipelines.PriceConverterPipeline': 300,
'example.pipelines.MongoDBPipeline': 400,
}

  run the crawler and view the results in the database:

 $ scrapy crawl books
...
$ mongo
MongoDB shell version: 2.4.9
connecting to: test
> use scrapy_data
switched to db scrapy_data
> db.books.count()
1000
> db.books.find()
{ "_id" : ObjectId("58ae39a89dcd191973cc588f"), "price" : "¥441.64", Attic" }
{ "_id" : ObjectId("58ae39a89dcd191973cc5890"), "price" : "¥458.45", Velvet" }
{ "_id" : ObjectId("58ae39a89dcd191973cc5891"), "price" : "¥427.40", { "_id" : ObjectId("58ae39a89dcd191973cc5892"), "price" : "¥407.95", { "_id" : ObjectId("58ae39a89dcd191973cc5893"), "price" : "¥462.63", History of Humankind" }
{ "_id" : ObjectId("58ae39a89dcd191973cc5894"), "price" : "¥193.22", Red" }
{ "_id" : ObjectId("58ae39a89dcd191973cc5895"), "price" : "¥284.42", Secrets of Getting Your Dream Job" }
{ "_id" : ObjectId("58ae39a89dcd191973cc5896"), "price" : "¥152.96", Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull" { "_id" : ObjectId("58ae39a89dcd191973cc5897"), "price" : "¥192.80", Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin { "_id" : ObjectId("58ae39a89dcd191973cc5898"), "price" : "¥444.89", { "_id" : ObjectId("58ae39a89dcd191973cc5899"), "price" : "¥119.35", (Triangular Trade Trilogy, #1)" }
{ "_id" : ObjectId("58ae39a89dcd191973cc589a"), "price" : "¥176.25", Sonnets" }
{ "_id" : ObjectId("58ae39a89dcd191973cc589b"), "price" : "¥148.95", { "_id" : ObjectId("58ae39a89dcd191973cc589c"), "price" : "¥446.08", Precious Little Life (Scott Pilgrim #1)" }
{ "_id" : ObjectId("58ae39a89dcd191973cc589d"), "price" : "¥298.75", Again" }
{ "_id" : ObjectId("58ae39a89dcd191973cc589e"), "price" : "¥488.39", Your Life: Scenes from the American Indie Underground, 1981-1991" }
{ "_id" : ObjectId("58ae39a89dcd191973cc589f"), "price" : "¥203.72", { "_id" : ObjectId("58ae39a89dcd191973cc58a0"), "price" : "¥320.68", Best Science Fiction Stories 1800-1849" }
{ "_id" : ObjectId("58ae39a89dcd191973cc58a1"), "price" : "¥437.89", Beginners" }
{ "_id" : ObjectId("58ae39a89dcd191973cc58a2"), "price" : "¥385.34", Himalayas" }
Type "it" for more

   in the above implementation, the URI address of the database and the name of the database are hard coded in the code. If you want to set them through the configuration file, you only need to change them slightly. The code is as follows:

from scrapy.item import Item
import pymongo
class MongoDBPipeline(object):

	@classmethod
	def from_crawler(cls, crawler):
		cls.DB_URI = crawler.settings.get('MONGO_DB_URI',
'mongodb://localhost:27017/')
		cls.DB_NAME = crawler.settings.get('MONGO_DB_NAME', 'scrapy_data')
		return cls()
		
	def open_spider(self, spider):
		self.client = pymongo.MongoClient(self.DB_URI)
		self.db = self.client[self.DB_NAME]
	
	def close_spider(self, spider):
		self.client.close()
	
	def process_item(self, item, spider):
		collection = self.db[spider.name]
		post = dict(item) if isinstance(item, Item) else item
		collection.insert_one(post)
		return item

  the above changes are explained as follows:
● add class method from_crawler (cls, crawler) is replaced in the class attribute
Define DB_URI and DB_NAME.
● if an Item Pipeline defines from_crawler method, scripy
This method is called to create an Item Pipeline object. The method has two parameters
Number:
* cls Item Pipeline class object (MongoDBPipeline class here)
Object).
* crawler Crawler is a core object in scripy, which can be accessed through
The settings property of crawler accesses the configuration file.
● from_ In the crawler method, read Mongo in the configuration file_ DB_ Uri and MONGO_DB_NAME (does not exist, use the default value), the attribute assigned to cls, that is, the MongoDBPipeline class attribute.
● there is no change in other codes, because only the settings are changed here
The MongoDBPipeline class property.
  now we can set the database to be used in the configuration file settings.py:

 MONGO_DB_URI = 'mongodb://192.168.1.105:27017/'
 MONGO_DB_NAME = 'liushuo_scrapy_data'

5.3 summary of this chapter

    this chapter learned how to use Item Pipeline to process the crawled data. First, it explained the application scenario and specific use of Item Pipeline with a simple example, and then showed two examples of the actual application of Item Pipeline.
  this article refers to the PDF of proficient in Scrapy web crawler + (written by Liu Shuo) and runs the relevant code by yourself. The content of the code is slightly modified for reference and note review only

Posted by MuseiKaze on Mon, 01 Nov 2021 08:28:37 -0700