Application of Scrapy and MongoDB

Keywords: MongoDB Database Spark Python

Links to the original text: http://www.cnblogs.com/JackQ/p/4843701.html

Scrapy is a fast, high-level screen capture and Web Capture framework developed by Python. It is used to capture Web sites and extract structured data from pages. Its most attractive feature is that anyone can easily modify it according to their needs.

MongoDB NoSql is a very popular open source non-relational database, which stores data in the form of "key-value". Big data Quantity, high concurrency and weak transaction have great advantages.

When Scrapy and MongoDB What kind of spark will the two collide? and MongoDB What kind of spark will the two collide? Now let's do a simple crawl novel TEST

1. Install Scrapy

pip install scrapy

2. Download and install MongoDB Visualization of Mongo VUE

[MongoDB Download address] (https://www.mongodb.org/)

The steps for downloading and installing are omitted, and a data folder is created under the bin directory to store the data.

[MongoVUE download address] (http://www.mongovue.com/)

After installation, we need to create a database.

3. Create a Scrapy project

scrapy startproject novelspider

Directory structure: where novspider.py is created manually (contrloDB doesn't need to be ignored)

4. Coding

Target website: http://www.daomubiji.com/

settings.py

BOT_NAME = 'novelspider'

SPIDER_MODULES = ['novelspider.spiders']
NEWSPIDER_MODULE = 'novelspider.spiders'

ITEM_PIPELINES = ['novelspider.pipelines.NovelspiderPipeline']  #Method of Importing pipelines.py

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0'
COOKIES_ENABLED = True

MONGODB_HOST = '127.0.0.1'   
MONGODB_PORT = 27017
MONGODB_DBNAME = 'zzl'    #Database name
MONGODB_DOCNAME = 'Book'   #Table name

pipelines.py

from scrapy.conf import settings
import pymongo
class NovelspiderPipeline(object):
	def __init__(self):
		host = settings['MONGODB_HOST']
		port = settings['MONGODB_PORT']
		dbName = settings['MONGODB_DBNAME']
		client = pymongo.MongoClient(host=host, port=port)
		tdb = client[dbName]
		self.post = tdb[settings['MONGODB_DOCNAME']]
	def process_item(self, item, spider):
		bookInfo = dict(item)
		self.post.insert(bookInfo)
		return item

items.py

from scrapy import Item,Field
class NovelspiderItem(Item):
	# define the fields for your item here like:
	# name = scrapy.Field()
	bookName = Field()
	bookTitle = Field()
	chapterNum = Field()
	chapterName = Field()
	chapterURL = Field()

Create novspider.py in the spiders directory

from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from novelspider.items import NovelspiderItem
class novSpider(CrawlSpider):
	name = "novspider"
	redis_key = 'novspider:start_urls'
	start_urls = ['http://www.daomubiji.com/']
	def parse(self,response):
		selector = Selector(response)
		table = selector.xpath('//table')
		for each in table:
			bookName = each.xpath('tr/td[@colspan="3"]/center/h2/text()').extract()[0]
			content = each.xpath('tr/td/a/text()').extract()
			url = each.xpath('tr/td/a/@href').extract()
			for i in range(len(url)):
				item = NovelspiderItem()
				item['bookName'] = bookName
				item['chapterURL'] = url[i]
				try:
					item['bookTitle'] = content[i].split(' ')[0]
					item['chapterNum'] = content[i].split(' ')[1]
				except Exception,e:
					continue
				try:
					item['chapterName'] = content[i].split(' ')[2]
				except Exception,e:
					item['chapterName'] = content[i].split(' ')[1][-3:]
				yield item

5. Start Project Command: scrapy crawl novspider.

Grab results

Reprinted at: https://www.cnblogs.com/JackQ/p/4843701.html

Posted by pollysal on Tue, 30 Jul 2019 18:00:21 -0700