Because of python's outstanding performance in artificial intelligence, I have been exploring Python recently. I hope to have a step-by-step understanding of Python in the future. I also look forward to discussing python's various issues with you.With a weekend break at home, I wrote a small crawler program, and the problems I encountered during the crawler process are summarized here.
(1) Programming environment
System: ubuntu 18.04
python :2.7.15
Scrapy :1.5.0
lxml : 4.2.3.0
IDE : pycharm
(2) Install Scrapy
Check your running environment with the following code:
pythoner@pythoner-Lenovo-M4400s:~/Desktop/douban$ python Python 2.7.15rc1 (default, Apr 15 2018, 21:51:34) [GCC 7.3.0] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>>
pythoner@pythoner-Lenovo-M4400s:~/Desktop/douban$ pip --version pip 9.0.1 from /usr/lib/python2.7/dist-packages (python 2.7)
Install Scrapy
pythoner@pythoner-Lenovo-M4400s:~/Desktop/douban$ pip install scrapy Collecting scrapy Downloading Scrapy-1.4.0-py2.py3-none-any.whl (248kB) 100% |████████████████████████████████| 256kB 188kB/s // Long installation process Successfully installed Twisted-17.9.0 scrapy-1.4.0
During the installation of Scrapy, I was prompted that the installation failed because I did not have permission to get files, so I switched to superuser installation, which is to add sudo before pip.
(3) Create Scrapy Project
After installing Scrapy, you can create a project using the command line tool "scrapy start project projectname", which is a global command and does not need to be run within the project.The code is as follows:
pythoner@pythoner-Lenovo-M4400s:~/Desktop/douban$ scrapy startproject SpiderDemo New Scrapy project 'SpiderDemo', using template directory '/usr/local/lib/python2.7/dist-packages/scrapy/templates/project', created in: /home/pythoner/Desktop/douban/SpiderDemo You can start your first spider with: cd SpiderDemo scrapy genspider example example.com
When you create a project, a new project file is added to the directory you are about to save with the following structure:
SpiderDemo/ scrapy.cfg # Deployment Profile SpiderDemo/ # python module __init__.py items.py # Data Container pipelines.py # project pipelines file settings.py # configuration file spiders/ # The Spider class defines how to crawl a (or some) Web site __init__.py
Create a duban_Spider class to perform crawling in SpiderDemo/spiders with the following code:
# -*- coding: utf-8 -*- import scrapy import urlparse class DoubanSpider(scrapy.Spider): #The name of the spider defines how Scrapy locates (and initializes) the spider, so it must be unique name = 'douban_spider' # URL s whose domain names are not in the list will not be crawled allowed_domains = ['www.imooc.com'] # Starting URL List start_urls = ['http://www.imooc.com/course/list'] def parse(self, response): learn_nodes = response.css('a.course-card') for learn_node in learn_nodes: learn_url = learn_node.css("::attr(href)").extract_first() yield scrapy.Request(url=urlparse.urljoin(response.url, learn_url), callback=self.parse_learn) def parse_learn(self, response): title = response.xpath('//h2[@class="l"]/text()').extract_first() content = response.xpath('//div[@class="course-brief"]/p/text()').extract_first() url = response.url print ('Title:' + title) print ('Address:' + url)
Execute crawler
pythoner@pythoner-Lenovo-M4400s:~/Desktop/douban$ scrapy crawl douban_spider
During execution, a character encoding error was reported as follows:
Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/home/pythoner/Desktop/douban/douban/spiders/douban_spider.py", line 21, in parse_learn print ('Title:' + title) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)
This problem was encountered because the default python compiler for the ubuntu system is version 2.7. After searching, it was found that the default encoding for python 2.x is ascii, and the code may be caused by UTF-8 characters. The solution is to set utf-8.
Find the file in error and add the following lines after import as follows:
# -*- coding: utf-8 -*- import scrapy import urlparse import sys if sys.getdefaultencoding() != 'utf-8': reload(sys) sys.setdefaultencoding('utf-8') class DoubanSpider(scrapy.Spider): #The name of the spider defines how Scrapy locates (and initializes) the spider, so it must be unique name = 'douban_spider' # URL s whose domain names are not in the list will not be crawled allowed_domains = ['www.imooc.com'] # Starting URL List start_urls = ['http://www.imooc.com/course/list'] def parse(self, response): learn_nodes = response.css('a.course-card') for learn_node in learn_nodes: learn_url = learn_node.css("::attr(href)").extract_first() yield scrapy.Request(url=urlparse.urljoin(response.url, learn_url), callback=self.parse_learn) def parse_learn(self, response): title = response.xpath('//h2[@class="l"]/text()').extract_first() content = response.xpath('//div[@class="course-brief"]/p/text()').extract_first() url = response.url print ('Title:' + title) print ('Address:' + url)
After adding, run the spider class again, and you can see that the crawl was successful as follows.
2018-07-08 18:57:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/994> (referer: http://www.imooc.com/course/list) 2018-07-08 18:57:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/995> (referer: http://www.imooc.com/course/list) 2018-07-08 18:57:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/997> (referer: http://www.imooc.com/course/list) 2018-07-08 18:57:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/984> (referer: http://www.imooc.com/course/list) 2018-07-08 18:57:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/998> (referer: http://www.imooc.com/course/list) Title: C4D Ground Polygon Modeling Address: http://www.imooc.com/learn/987 Title: Combining TensorFlow with Flask for Handwritten Number Recognition Address: http://www.imooc.com/learn/994 Title: Modeling C4D Cosmetics Set Address: http://www.imooc.com/learn/995 Title: Module System for Java9 Address: http://www.imooc.com/learn/997 Title: MAYA - Mapping Basis Address: http://www.imooc.com/learn/984 Title: Unity 3D Reversal Game Development Address: http://www.imooc.com/learn/998
Later, we will continue to learn more about Scrapy, and we hope you can discuss it together.
Reprinted at: https://my.oschina.net/GloryMK/blog/1842367