Scrapy reptile battle - crawling body color permutation 5 historical data

Keywords: PHP Windows Selenium encoding

Website address:

1. Create a new crawler project

scrapy startproject pfive

2. Create a new crawler in spiders directory

scrapy genspider pfive_spider

3. Modify the entry url in the crawler file

start_urls = ['']

4. Add crawling entry

class PfiveItem(scrapy.Item):
    #Lottery date
    awardID = scrapy.Field()
    #Lottery date
    awardDate = scrapy.Field()
    #Lottery number
    awardNum = scrapy.Field()

5. Write a crawler to parse the website through xpath

class PfiveSpiderSpider(scrapy.Spider):
    name = 'pfive_spider'
    allowed_domains = ['']
    start_urls = ['']

    def parse(self, response):
        list = response.xpath("//table/tbody/tr/td/table/tbody/tr[3]/td[@class='normal']/table/tbody/tr[@bgcolor='#ffffff']")
        for l in list:
            pfiveItem = PfiveItem()
            pfiveItem['awardID'] = l.xpath('./td[1]/text()').extract_first()
            pfiveItem['awardDate'] = l.xpath('./td[2]/text()').extract_first()
            pfiveItem['awardNum'] = l.xpath('./td[3]/text()').extract_first()
            yield pfiveItem

6. Ignore the robots.txt file in the configuration file (for learning only)

# Obey robots.txt rules

7. Open the user agent in the configuration file.

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'

8. Write the startup file

from scrapy import cmdline
cmdline.execute('scrapy crawl pfive_spider'.split())

It's OK, but I can't catch anything!

By checking response.text, we find that the table data is loaded asynchronously. How does Baidu deal with this kind of webpage?

The blogger solved this problem by using selenium automated test packs.

9. First write the download middleware and add it in the configuration.

class JavaScriptMiddleware(object):

    def process_request(self, request, spider):
        if == "pfive_spider":
            driver = webdriver.Chrome("G:\\Crawler\chromedriver.exe") #Specify the browser to use
            js = "var q=document.documentElement.scrollTop=10000" #Simulated browsing page
            driver.execute_script(js) #Executable js,Simulate user actions. This is to pull the page to the bottom.
            body = driver.page_source
            print ("Visit"+request.url)
            return HtmlResponse(driver.current_url, body=body, encoding='utf-8', request=request)
            return None

Note: the version of chrom edriver.exe should be the same as that of the native Chrome browser.

OK, that's it. It's done.


No, it's just the data on the first page... I'll make up later

Posted by gid on Fri, 01 Nov 2019 04:32:47 -0700