Website address: http://www.17500.cn/p5/all.php
1. Create a new crawler project
scrapy startproject pfive
2. Create a new crawler in spiders directory
scrapy genspider pfive_spider www.17500.cn
3. Modify the entry url in the crawler file
start_urls = ['http://www.17500.cn/p5/all.php']
4. Add crawling entry
class PfiveItem(scrapy.Item): #Lottery date awardID = scrapy.Field() #Lottery date awardDate = scrapy.Field() #Lottery number awardNum = scrapy.Field()
5. Write a crawler to parse the website through xpath
class PfiveSpiderSpider(scrapy.Spider): name = 'pfive_spider' allowed_domains = ['www.17500.cn'] start_urls = ['http://www.17500.cn/p5/all.php'] def parse(self, response): list = response.xpath("//table/tbody/tr/td/table/tbody/tr[3]/td[@class='normal']/table/tbody/tr[@bgcolor='#ffffff']") for l in list: pfiveItem = PfiveItem() pfiveItem['awardID'] = l.xpath('./td[1]/text()').extract_first() pfiveItem['awardDate'] = l.xpath('./td[2]/text()').extract_first() pfiveItem['awardNum'] = l.xpath('./td[3]/text()').extract_first() yield pfiveItem
6. Ignore the robots.txt file in the configuration file (for learning only)
# Obey robots.txt rules ROBOTSTXT_OBEY = False
7. Open the user agent in the configuration file.
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
8. Write the startup file main.py
from scrapy import cmdline cmdline.execute('scrapy crawl pfive_spider'.split())
It's OK, but I can't catch anything!
By checking response.text, we find that the table data is loaded asynchronously. How does Baidu deal with this kind of webpage?
https://blog.csdn.net/dangsh_/article/details/78633566
The blogger solved this problem by using selenium automated test packs.
9. First write the download middleware and add it in the configuration.
class JavaScriptMiddleware(object): def process_request(self, request, spider): if spider.name == "pfive_spider": driver = webdriver.Chrome("G:\\Crawler\chromedriver.exe") #Specify the browser to use driver.get(request.url) time.sleep(1) js = "var q=document.documentElement.scrollTop=10000" #Simulated browsing page driver.execute_script(js) #Executable js,Simulate user actions. This is to pull the page to the bottom. time.sleep(3) body = driver.page_source print ("Visit"+request.url) return HtmlResponse(driver.current_url, body=body, encoding='utf-8', request=request) else: return None
Note: the version of chrom edriver.exe should be the same as that of the native Chrome browser.
http://chromedriver.storage.googleapis.com/index.html
OK, that's it. It's done.
No, it's just the data on the first page... I'll make up later