Scrapy, write the first crawl problem summary

Keywords: Python pip Ubuntu encoding

Original Link: https://my.oschina.net/GloryMK/blog/1842367

Because of python's outstanding performance in artificial intelligence, I have been exploring Python recently. I hope to have a step-by-step understanding of Python in the future. I also look forward to discussing python's various issues with you.With a weekend break at home, I wrote a small crawler program, and the problems I encountered during the crawler process are summarized here.

(1) Programming environment

System: ubuntu 18.04

      python  :2.7.15

     Scrapy    :1.5.0

      lxml         : 4.2.3.0

     IDE           : pycharm

(2) Install Scrapy

Check your running environment with the following code:

pythoner@pythoner-Lenovo-M4400s:~/Desktop/douban$ python
Python 2.7.15rc1 (default, Apr 15 2018, 21:51:34) 
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 
pythoner@pythoner-Lenovo-M4400s:~/Desktop/douban$ pip --version
pip 9.0.1 from /usr/lib/python2.7/dist-packages (python 2.7)

Install Scrapy

pythoner@pythoner-Lenovo-M4400s:~/Desktop/douban$ pip install scrapy
Collecting scrapy
  Downloading Scrapy-1.4.0-py2.py3-none-any.whl (248kB)
    100% |████████████████████████████████| 256kB 188kB/s
    // Long installation process
Successfully installed Twisted-17.9.0 scrapy-1.4.0

During the installation of Scrapy, I was prompted that the installation failed because I did not have permission to get files, so I switched to superuser installation, which is to add sudo before pip.

(3) Create Scrapy Project

After installing Scrapy, you can create a project using the command line tool "scrapy start project projectname", which is a global command and does not need to be run within the project.The code is as follows:

pythoner@pythoner-Lenovo-M4400s:~/Desktop/douban$ scrapy startproject SpiderDemo
New Scrapy project 'SpiderDemo', using template directory '/usr/local/lib/python2.7/dist-packages/scrapy/templates/project', created in:
    /home/pythoner/Desktop/douban/SpiderDemo

You can start your first spider with:
    cd SpiderDemo
    scrapy genspider example example.com

When you create a project, a new project file is added to the directory you are about to save with the following structure:

SpiderDemo/
    scrapy.cfg            # Deployment Profile

    SpiderDemo/           # python module
        __init__.py

        items.py          # Data Container

        pipelines.py      # project pipelines file

        settings.py       # configuration file

        spiders/          # The Spider class defines how to crawl a (or some) Web site
            __init__.py

Create a duban_Spider class to perform crawling in SpiderDemo/spiders with the following code:

# -*- coding: utf-8 -*-
import scrapy
import urlparse


class DoubanSpider(scrapy.Spider):
    #The name of the spider defines how Scrapy locates (and initializes) the spider, so it must be unique
    name = 'douban_spider'
    # URL s whose domain names are not in the list will not be crawled
    allowed_domains = ['www.imooc.com']
    # Starting URL List
    start_urls = ['http://www.imooc.com/course/list']

    def parse(self, response):
        learn_nodes = response.css('a.course-card')
        for learn_node in learn_nodes:
            learn_url = learn_node.css("::attr(href)").extract_first()
            yield scrapy.Request(url=urlparse.urljoin(response.url, learn_url), callback=self.parse_learn)

    def parse_learn(self, response):
        title = response.xpath('//h2[@class="l"]/text()').extract_first()
        content = response.xpath('//div[@class="course-brief"]/p/text()').extract_first()
        url = response.url
        print ('Title:' + title)
        print ('Address:' + url)

Execute crawler

pythoner@pythoner-Lenovo-M4400s:~/Desktop/douban$ scrapy crawl douban_spider

During execution, a character encoding error was reported as follows:

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/pythoner/Desktop/douban/douban/spiders/douban_spider.py", line 21, in parse_learn
    print ('Title:' + title)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)

This problem was encountered because the default python compiler for the ubuntu system is version 2.7. After searching, it was found that the default encoding for python 2.x is ascii, and the code may be caused by UTF-8 characters. The solution is to set utf-8.

Find the file in error and add the following lines after import as follows:

# -*- coding: utf-8 -*-
import scrapy
import urlparse
import sys

if sys.getdefaultencoding() != 'utf-8':
    reload(sys)
    sys.setdefaultencoding('utf-8')

class DoubanSpider(scrapy.Spider):
    #The name of the spider defines how Scrapy locates (and initializes) the spider, so it must be unique
    name = 'douban_spider'
    # URL s whose domain names are not in the list will not be crawled
    allowed_domains = ['www.imooc.com']
    # Starting URL List
    start_urls = ['http://www.imooc.com/course/list']

    def parse(self, response):
        learn_nodes = response.css('a.course-card')
        for learn_node in learn_nodes:
            learn_url = learn_node.css("::attr(href)").extract_first()
            yield scrapy.Request(url=urlparse.urljoin(response.url, learn_url), callback=self.parse_learn)

    def parse_learn(self, response):
        title = response.xpath('//h2[@class="l"]/text()').extract_first()
        content = response.xpath('//div[@class="course-brief"]/p/text()').extract_first()
        url = response.url
        print ('Title:' + title)
        print ('Address:' + url)

After adding, run the spider class again, and you can see that the crawl was successful as follows.

2018-07-08 18:57:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/994> (referer: http://www.imooc.com/course/list)
2018-07-08 18:57:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/995> (referer: http://www.imooc.com/course/list)
2018-07-08 18:57:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/997> (referer: http://www.imooc.com/course/list)
2018-07-08 18:57:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/984> (referer: http://www.imooc.com/course/list)
2018-07-08 18:57:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/998> (referer: http://www.imooc.com/course/list)
Title: C4D Ground Polygon Modeling
 Address: http://www.imooc.com/learn/987
 Title: Combining TensorFlow with Flask for Handwritten Number Recognition
 Address: http://www.imooc.com/learn/994
 Title: Modeling C4D Cosmetics Set
 Address: http://www.imooc.com/learn/995
 Title: Module System for Java9
 Address: http://www.imooc.com/learn/997
 Title: MAYA - Mapping Basis
 Address: http://www.imooc.com/learn/984
 Title: Unity 3D Reversal Game Development
 Address: http://www.imooc.com/learn/998

Later, we will continue to learn more about Scrapy, and we hope you can discuss it together.

Reprinted at: https://my.oschina.net/GloryMK/blog/1842367

Posted by rReLmy on Sun, 08 Sep 2019 20:05:37 -0700