Scrapy reptile battle - crawling body color permutation 5 historical data

Keywords: PHP Windows Selenium encoding

Website address: http://www.17500.cn/p5/all.php

1. Create a new crawler project

scrapy startproject pfive

2. Create a new crawler in spiders directory

scrapy genspider pfive_spider www.17500.cn

3. Modify the entry url in the crawler file

start_urls = ['http://www.17500.cn/p5/all.php']

4. Add crawling entry

class PfiveItem(scrapy.Item):
    #Lottery date
    awardID = scrapy.Field()
    #Lottery date
    awardDate = scrapy.Field()
    #Lottery number
    awardNum = scrapy.Field()

5. Write a crawler to parse the website through xpath

class PfiveSpiderSpider(scrapy.Spider):
    name = 'pfive_spider'
    allowed_domains = ['www.17500.cn']
    start_urls = ['http://www.17500.cn/p5/all.php']

    def parse(self, response):
        list = response.xpath("//table/tbody/tr/td/table/tbody/tr[3]/td[@class='normal']/table/tbody/tr[@bgcolor='#ffffff']")
        for l in list:
            pfiveItem = PfiveItem()
            pfiveItem['awardID'] = l.xpath('./td[1]/text()').extract_first()
            pfiveItem['awardDate'] = l.xpath('./td[2]/text()').extract_first()
            pfiveItem['awardNum'] = l.xpath('./td[3]/text()').extract_first()
            yield pfiveItem

6. Ignore the robots.txt file in the configuration file (for learning only)

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

7. Open the user agent in the configuration file.

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'

8. Write the startup file main.py

from scrapy import cmdline
cmdline.execute('scrapy crawl pfive_spider'.split())

It's OK, but I can't catch anything!

By checking response.text, we find that the table data is loaded asynchronously. How does Baidu deal with this kind of webpage?

https://blog.csdn.net/dangsh_/article/details/78633566

The blogger solved this problem by using selenium automated test packs.

9. First write the download middleware and add it in the configuration.

class JavaScriptMiddleware(object):

    def process_request(self, request, spider):
        if spider.name == "pfive_spider":
            driver = webdriver.Chrome("G:\\Crawler\chromedriver.exe") #Specify the browser to use
            driver.get(request.url)
            time.sleep(1)
            js = "var q=document.documentElement.scrollTop=10000" #Simulated browsing page
            driver.execute_script(js) #Executable js,Simulate user actions. This is to pull the page to the bottom.
            time.sleep(3)
            body = driver.page_source
            print ("Visit"+request.url)
            return HtmlResponse(driver.current_url, body=body, encoding='utf-8', request=request)
        else:
            return None

Note: the version of chrom edriver.exe should be the same as that of the native Chrome browser.

http://chromedriver.storage.googleapis.com/index.html

OK, that's it. It's done.

 

No, it's just the data on the first page... I'll make up later

Posted by gid on Fri, 01 Nov 2019 04:32:47 -0700