Recently, I learned to write crawlers with Scrapy framework. In short, crawlers are to grab web pages from the Internet, parse web pages, and then store and analyze data, which will be transformed from web page analysis to data storage. The parsing technology used in the learning process, the use and advancement of each module of Scrapy to the knowledge points learned by distributed crawlers, the problems encountered and the solutions are recorded here for summary and memorandum, and also hope to be helpful to the students who need it.
This article mainly explains the use of pipeline saving data module, including storing data as Json files, storing data in MySQL database and saving pictures.
Scrapy provides pipeline modules to perform data saving operations. In the created Scrappy project, a pipeline.py file is automatically created and a default Pipline class is created. We can customize the Pipeline class as needed and configure it in the settings.py file, as follows
# Specify the Pipeline class used to process data, followed by numbers representing the execution order, with values ranging from 0 to 1000 ranges.
# Pipeline classes with small values are preferred to execute
ITEM_PIPELINES = {
'StackoverFlowSpider.pipelines.StackoverflowspiderPipeline': 2,
}
Next, we customize the Pipeline class to store Item into Json files.
The Pipeline class processes data in the process_item method and then calls the close_spider method at the end, so we
Need to customize these two methods to do the corresponding processing.
Two tips
- After the process_item() method has been processed, the item is returned for further operation by the subsequent Pipline class.
- Remember to release resources in close_spider()
1. Custom Pipeline Stores Json Data
import json
import codecs
class StackJsonPipeline:
# Specify the file to be operated on at initialization
def __init__(self):
self.file = codecs.open('questions.json', 'w', encoding='utf-8')
# Store data and write Item instances to files as json data
def process_item(self, item, spider):
lines = json.dumps(dict(item), ensure_ascii=False) + '\n'
self.file.write(lines)
return item
# Close the file IO stream after processing
def close_spider(self, spider):
self.file.close()
2. Store Json data using exporter provided by Scrapy
Scrapy provides us with a JsonItemExporter class for storing Json data, which is very convenient. Here's how to use this class
Samples of custom Pipeline classes for storage.
from scrapy.exporters import JsonItemExporter
class JsonExporterPipeline:
# Export json files by calling json exporter provided by scrapy
def __init__(self):
self.file = open('questions_exporter.json', 'wb')
# Initialize the exporter instance, execute the output file and encoding
self.exporter = JsonItemExporter(self.file,encoding='utf-8',ensure_ascii=False)
# Open reciprocal
self.exporter.start_exporting()
def close_spider(self, spider):
self.exporter.finish_exporting()
self.file.close()
# Export Item instance to json file
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
The above are two examples of using a custom Pipeline class to generate Json data. One thing to emphasize is that we use exporter to generate Json data.
Actually, it's an array. Here's a screenshot of two files I generated using the two classes above. The first one generates a lot of Json data, and the second one is generated by a single Json data.
An array of Json data:
- Files generated using json module
- Files generated using scrapy.exporters.JsonItemExporter
3. Save the data to MySQL database
Here's a Pipeline class that saves our data to the MySQL database
# Here we use mysql-connector-python driver and can install with pip
import mysql.connector
class MysqlPipeline:
def __init__(self):
self.conn = mysql.connector.connect(user='root', password='root', database='stack_db', )
self.cursor = self.conn.cursor()
def process_item(self, item, spider):
title = item.get('question_title')
votes = item.get('question_votes')
answers = item.get('question_answers')
views = item.get('question_views')
tags = item.get('tags')
insert_sql = """
insert into stack_questions(title, votes, answers, views,tags)
VALUES (%s, %s, %s, %s,%s);
"""
self.cursor.execute(insert_sql, (title, votes, answers, views, tags))
self.conn.commit()
return item
def close_spider(self, spider):
self.cursor.close()
self.conn.close()
After configuring Mysql Pipeline in the settings file, the crawled data can be stored in the MySQL database. It is convenient to write the SQL statements directly here.
In the process_item() method, it is best to encapsulate the SQL statement into the method in actual development, and then encapsulate its special Item class, so that our processing method can be passed according to the transmission.
Different Item s call different SQL statements, which can greatly improve the scalability of the program and the reusability of our crawler code.
4. Implementing Asynchronous Operation of MySQL Storage
Although the above PPeline class can write data in MySQL data, the processing of data in Scrapy is carried out synchronously. When the amount of data crawled is large, the speed of inserting data will not keep pace with the speed of crawling and parsing Web pages, resulting in blockage. To solve this problem, MySQL data storage is needed. Asynchronization. Twisted framework is provided in Python to implement asynchronous operation, which provides a connection pool through which data insertion into MySQL can be asynchronized.
The following is the Pipeline class implemented by the Twisted framework, which can complete the asynchronous operation of MySQL:
The pymysql module is used here. When Pipeline is initialized, the database connection pool dbpool is created by parameters.
Then the connection pool is configured in the process_item method to execute its execution method and data. Instead of writing out the SQL statements in the examples above, we encapsulate them in specific Item classes, so that our Pipeline class can handle different kinds of data.
import pymysql
from twisted.enterprise import adbapi
class MysqlTwistedPipline(object):
def __init__(self, ):
dbparms = dict(
host='localhost',
db='stack_db',
user='root',
passwd='root',
charset='utf8',
cursorclass=pymysql.cursors.DictCursor, # Specify curosr type
use_unicode=True,
)
# Specify the module name and database parameters for wiping database
self.dbpool = adbapi.ConnectionPool("pymysql", **dbparms)
# Using twisted to insert mysql into asynchronous execution
def process_item(self, item, spider):
# Specify operation methods and operation data
query = self.dbpool.runInteraction(self.do_insert, item)
# Specify exception handling methods
query.addErrback(self.handle_error, item, spider) #Handling exceptions
def handle_error(self, failure, item, spider):
#Handling the exception of asynchronous insertion
print (failure)
def do_insert(self, cursor, item):
#Perform specific inserts
#Build different sql statements based on different item s and insert them into mysql
insert_sql, params = item.get_insert_sql()
cursor.execute(insert_sql, params)
The get_insert_sql() method code in QuestionItem is as follows:
def get_insert_sql(self):
insert_sql = """
insert into stack_questions(title, votes, answers, views,tags)
VALUES (%s, %s, %s, %s,%s);
"""
params = (self["question_title"], self["question_votes"], self["question_answers"], self["question_views"],self["tags"])
return insert_sql,params
5. Use Scrapy's own Images Pipeline to save pictures
Basically all of the above are our custom Pipeline classes to manipulate data. Now, I will briefly introduce a Pipeline class provided by Scrapy, ImagesPipeline.
By executing this class and configuring it, image data can be automatically saved locally while crawling. Here is a brief description of its usage:
- Configure settings files
Configuration in ITEM_PIPELINES
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
}
- Configure saved fields and local paths
Because our data is encapsulated in the Item class, what we need to do after configuring the ImagesPipeline class is to let the class know which field data should be saved and where to save it.
The following variables need to be configured in settings:
# The field to be saved, that is, the field in the Item class is named image_url
IMAGES_URLS_FIELD = 'image_url'
import os
# Configure the data save path in the image directory under the current project directory
project_dir = os.path.abspath(os.path.dirname(__file__))
IMAGES_STORE = os.path.join(project_dir, 'images')
# Setting the Maximum and Minimum of Pictures
# IMAGES_MIN_HEIGHT = 100
# IMAGES_MIN_WIDTH = 100
- Pass arrays when passing parameters
ImagesPipeline requires that data be passed in data form, otherwise errors will be reported.
item_loader.add_value("image_url", [image_url])
After the above three steps to crawl again, if the crawled content has image data, you can follow the steps above to download the picture.
The usage of Pipeline class is introduced. Scrapy also provides more self-contained Pipeline class. Interested students can refer to the document for further study.
Now all the operations on Scrapy have been basically completed, from crawler creation, crawl parsing, Item encapsulation to Pipeline preservation, the next one is a complete crawl.
An example of stack overflow website concludes.