Scrapy Initial Notes (4) -- Use Pipeline to save data

Keywords: JSON MySQL Database SQL

Recently, I learned to write crawlers with Scrapy framework. In short, crawlers are to grab web pages from the Internet, parse web pages, and then store and analyze data, which will be transformed from web page analysis to data storage. The parsing technology used in the learning process, the use and advancement of each module of Scrapy to the knowledge points learned by distributed crawlers, the problems encountered and the solutions are recorded here for summary and memorandum, and also hope to be helpful to the students who need it.

This article mainly explains the use of pipeline saving data module, including storing data as Json files, storing data in MySQL database and saving pictures.

Scrapy provides pipeline modules to perform data saving operations. In the created Scrappy project, a pipeline.py file is automatically created and a default Pipline class is created. We can customize the Pipeline class as needed and configure it in the settings.py file, as follows

# Specify the Pipeline class used to process data, followed by numbers representing the execution order, with values ranging from 0 to 1000 ranges.
# Pipeline classes with small values are preferred to execute
ITEM_PIPELINES = {
   'StackoverFlowSpider.pipelines.StackoverflowspiderPipeline': 2,
}

Next, we customize the Pipeline class to store Item into Json files.

The Pipeline class processes data in the process_item method and then calls the close_spider method at the end, so we
Need to customize these two methods to do the corresponding processing.

Two tips

  • After the process_item() method has been processed, the item is returned for further operation by the subsequent Pipline class.
  • Remember to release resources in close_spider()

1. Custom Pipeline Stores Json Data

import json
import codecs
class StackJsonPipeline:

    # Specify the file to be operated on at initialization
    def __init__(self):
        self.file = codecs.open('questions.json', 'w', encoding='utf-8')

    # Store data and write Item instances to files as json data
    def process_item(self, item, spider):
        lines = json.dumps(dict(item), ensure_ascii=False) + '\n'
        self.file.write(lines)
        return item
    # Close the file IO stream after processing
    def close_spider(self, spider):
        self.file.close()

2. Store Json data using exporter provided by Scrapy

Scrapy provides us with a JsonItemExporter class for storing Json data, which is very convenient. Here's how to use this class
Samples of custom Pipeline classes for storage.

from scrapy.exporters import JsonItemExporter
class JsonExporterPipeline:
    # Export json files by calling json exporter provided by scrapy
    def __init__(self):
        self.file = open('questions_exporter.json', 'wb')
        # Initialize the exporter instance, execute the output file and encoding
        self.exporter = JsonItemExporter(self.file,encoding='utf-8',ensure_ascii=False)
        # Open reciprocal
        self.exporter.start_exporting()

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

    # Export Item instance to json file
    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

The above are two examples of using a custom Pipeline class to generate Json data. One thing to emphasize is that we use exporter to generate Json data.
Actually, it's an array. Here's a screenshot of two files I generated using the two classes above. The first one generates a lot of Json data, and the second one is generated by a single Json data.
An array of Json data:

  • Files generated using json module

  • Files generated using scrapy.exporters.JsonItemExporter

3. Save the data to MySQL database

Here's a Pipeline class that saves our data to the MySQL database

# Here we use mysql-connector-python driver and can install with pip
import mysql.connector

class MysqlPipeline:
    def __init__(self):
        self.conn = mysql.connector.connect(user='root', password='root', database='stack_db', )
        self.cursor = self.conn.cursor()


    def process_item(self, item, spider):

        title = item.get('question_title')
        votes = item.get('question_votes')
        answers = item.get('question_answers')
        views = item.get('question_views')
        tags = item.get('tags')
        insert_sql = """
            insert into stack_questions(title, votes, answers, views,tags)
            VALUES (%s, %s, %s, %s,%s);
        """
        self.cursor.execute(insert_sql, (title, votes, answers, views, tags))
        self.conn.commit()
        return item

    def close_spider(self, spider):
        self.cursor.close()
        self.conn.close()

After configuring Mysql Pipeline in the settings file, the crawled data can be stored in the MySQL database. It is convenient to write the SQL statements directly here.
In the process_item() method, it is best to encapsulate the SQL statement into the method in actual development, and then encapsulate its special Item class, so that our processing method can be passed according to the transmission.
Different Item s call different SQL statements, which can greatly improve the scalability of the program and the reusability of our crawler code.

4. Implementing Asynchronous Operation of MySQL Storage

Although the above PPeline class can write data in MySQL data, the processing of data in Scrapy is carried out synchronously. When the amount of data crawled is large, the speed of inserting data will not keep pace with the speed of crawling and parsing Web pages, resulting in blockage. To solve this problem, MySQL data storage is needed. Asynchronization. Twisted framework is provided in Python to implement asynchronous operation, which provides a connection pool through which data insertion into MySQL can be asynchronized.

The following is the Pipeline class implemented by the Twisted framework, which can complete the asynchronous operation of MySQL:

The pymysql module is used here. When Pipeline is initialized, the database connection pool dbpool is created by parameters.
Then the connection pool is configured in the process_item method to execute its execution method and data. Instead of writing out the SQL statements in the examples above, we encapsulate them in specific Item classes, so that our Pipeline class can handle different kinds of data.

import pymysql
from twisted.enterprise import adbapi
class MysqlTwistedPipline(object):
    def __init__(self, ):
        dbparms = dict(
            host='localhost',
            db='stack_db',
            user='root',
            passwd='root',
            charset='utf8',
            cursorclass=pymysql.cursors.DictCursor, # Specify curosr type
            use_unicode=True,
        )
        # Specify the module name and database parameters for wiping database
        self.dbpool = adbapi.ConnectionPool("pymysql", **dbparms)

    # Using twisted to insert mysql into asynchronous execution
    def process_item(self, item, spider):
        # Specify operation methods and operation data
        query = self.dbpool.runInteraction(self.do_insert, item)
        # Specify exception handling methods
        query.addErrback(self.handle_error, item, spider) #Handling exceptions

    def handle_error(self, failure, item, spider):
        #Handling the exception of asynchronous insertion
        print (failure)

    def do_insert(self, cursor, item):
        #Perform specific inserts
        #Build different sql statements based on different item s and insert them into mysql
        insert_sql, params = item.get_insert_sql()
        cursor.execute(insert_sql, params)

The get_insert_sql() method code in QuestionItem is as follows:

 def get_insert_sql(self):
      insert_sql = """
                 insert into stack_questions(title, votes, answers, views,tags)
                 VALUES (%s, %s, %s, %s,%s);
             """
      params = (self["question_title"], self["question_votes"], self["question_answers"], self["question_views"],self["tags"])
      return insert_sql,params

5. Use Scrapy's own Images Pipeline to save pictures

Basically all of the above are our custom Pipeline classes to manipulate data. Now, I will briefly introduce a Pipeline class provided by Scrapy, ImagesPipeline.
By executing this class and configuring it, image data can be automatically saved locally while crawling. Here is a brief description of its usage:

  • Configure settings files

Configuration in ITEM_PIPELINES

ITEM_PIPELINES = {
  'scrapy.pipelines.images.ImagesPipeline': 1,
}
  • Configure saved fields and local paths

Because our data is encapsulated in the Item class, what we need to do after configuring the ImagesPipeline class is to let the class know which field data should be saved and where to save it.
The following variables need to be configured in settings:

# The field to be saved, that is, the field in the Item class is named image_url
IMAGES_URLS_FIELD = 'image_url'

import os
# Configure the data save path in the image directory under the current project directory
project_dir = os.path.abspath(os.path.dirname(__file__))
IMAGES_STORE = os.path.join(project_dir, 'images')

# Setting the Maximum and Minimum of Pictures
# IMAGES_MIN_HEIGHT = 100
# IMAGES_MIN_WIDTH = 100
  • Pass arrays when passing parameters

ImagesPipeline requires that data be passed in data form, otherwise errors will be reported.

item_loader.add_value("image_url", [image_url])

After the above three steps to crawl again, if the crawled content has image data, you can follow the steps above to download the picture.

The usage of Pipeline class is introduced. Scrapy also provides more self-contained Pipeline class. Interested students can refer to the document for further study.
Now all the operations on Scrapy have been basically completed, from crawler creation, crawl parsing, Item encapsulation to Pipeline preservation, the next one is a complete crawl.
An example of stack overflow website concludes.

Posted by bliljerk101 on Sat, 29 Jun 2019 15:56:48 -0700