The scrape framework crawls the data of the big lotto

Keywords: Programming JSON Windows github xml

github project address: https://github.com/v587xpt/lottery_spider
#

Last time I did a data crawling of dichroic sphere, in fact, the crawling of the big Lotto is also very simple. You can use request to crawl, but in order to make better progress, the crawling of the big Lotto uses the scratch framework.

The operation mechanism of the summary framework is not introduced. If you don't understand it, go to google first.

..
..

I. create project

I use windows for development, so I need to install the framework on windows.

1. Open cmd and run

scrapy startproject lottery_spider

Command, it will generate a project of lotus? Spider under the file of command run.
.
2. Execute cd lotus spider to enter the lotus spider project and execute

scrapy gensiper lottery "www.lottery.gov.cn"

lottery is a crawler file;

www.lottery.gov.cn is the target website;

After creation, a crawler file will be generated under the spider folder of the project: lotus.py
..
..
 

II. Code of each document in the project

1,items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class LotterySpiderItem(scrapy.Item):
    qihao = scrapy.Field()
    bule_ball = scrapy.Field()
    red_ball = scrapy.Field()

This file defines the data model, which is the data parameter.
.
2,lottery.py

# -*- coding: utf-8 -*-
import scrapy
from lottery_spider.items import LotterySpiderItem

class LotterySpider(scrapy.Spider):
    name = 'lottery'
    allowed_domains = ['gov.cn']        #Allow crawlers to crawl the domain name of the target website, and those other than this domain name will not crawl;
    start_urls = ['http://Www.lotus.gov.cn/historykj/history ﹣ jspx? ﹣ Ltype = DLT '] ᦇ start page; crawl from qualified web;

    def parse(self, response):
        #Use xpath to get the path before the data, and return a list format data;
        results = response.xpath("//div[@class='yylMain']//div[@class='result']//tbody//tr")
        for result in results:  #results data needs for loop traversal;
            qihao = result.xpath(".//td[1]//text()").get()
            bule_ball_1 = result.xpath(".//td[2]//text()").get()
            bule_ball_2 = result.xpath(".//td[3]//text()").get()
            bule_ball_3 = result.xpath(".//td[4]//text()").get()
            bule_ball_4 = result.xpath(".//td[5]//text()").get()
            bule_ball_5 = result.xpath(".//td[6]//text()").get()
            red_ball_1 = result.xpath(".//td[7]//text()").get()
            red_ball_2 = result.xpath(".//td[8]//text()").get()

            bule_ball_list = []     #Define a list to store five basketball balls
            bule_ball_list.append(bule_ball_1)
            bule_ball_list.append(bule_ball_2)
            bule_ball_list.append(bule_ball_3)
            bule_ball_list.append(bule_ball_4)
            bule_ball_list.append(bule_ball_5)

            red_ball_list = []      #Define a list to store 2 red balls
            red_ball_list.append(red_ball_1)
            red_ball_list.append(red_ball_2)

            print("===================================================")
            print("❤Issue number:"+ str(qihao) + " ❤" + "Basketball:"+ str(bule_ball_list) + " ❤" + "Red ball" + str(red_ball_list))

            item = LotterySpiderItem(qihao = qihao,bule_ball = bule_ball_list,red_ball = red_ball_list)
            yield item

        next_url = response.xpath("//div[@class='page']/div/a[3]/@href").get()
        if not next_url:
            return
        else:
            last_url = "http://www.lottery.gov.cn/historykj/" + next_url
            yield scrapy.Request(last_url,callback=self.parse)  #There is no need to add () when calling the parse method.

This file is a running crawler file;

.
3,pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import json

class LotterySpiderPipeline(object):
    def __init__(self):
        print("Crawler begins......")
        self.fp = open("daletou.json", 'w', encoding='utf-8')  # Open a json file

    def process_item(self, item, spider):
        item_json = json.dumps(dict(item), ensure_ascii=False)      #Note that the item here needs dict for serialization;
        self.fp.write(item_json + '\n')
        return item

    def close_spider(self,spider):
        self.fp.close()
        print("Reptiles end......")

This file is responsible for saving the data. The code saves the data as json data.
.
4,settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for lottery_spider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'lottery_spider'

SPIDER_MODULES = ['lottery_spider.spiders']
NEWSPIDER_MODULE = 'lottery_spider.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'lottery_spider (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False    #False, do not look for the rebots.txt file of the website settings;

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1      #Configure crawler speed once a second
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {        #Configure request headers of crawlers to simulate browser requests;
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'lottery_spider.middlewares.LotterySpiderSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'lottery_spider.middlewares.LotterySpiderDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {    #Uncomment this configuration so that pipelines.py can run;
   'lottery_spider.pipelines.LotterySpiderPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

This file is the running configuration file of the whole crawler project;
.
5,start.py

from scrapy import cmdline

cmdline.execute("scrapy crawl lottery".split())
#Equivalent to ↓
# cmdline.execute(["scrapy","crawl","xiaoshuo"])

This file is a new one. After configuration, you can run the project without executing command in cmd.

Posted by Cerberus_26 on Sat, 26 Oct 2019 08:31:19 -0700