Python crawler: crawling comics with Scrapy framework (with source code)

Keywords: Programming Attribute zlib pip Python

Scratch is an application framework for crawling website data and extracting structural data. More details about the use of the framework can be found in the official documents. This article shows the general implementation process of crawling cartoon pictures.

Scrapy environment configuration

The first is the installation of scratch. The blogger uses the Mac system and runs the command line directly:

pip install Scrapy

For the extraction of html node information, the Beautiful Soup library is used. For the approximate usage, see the previous article. Install directly through the command:

pip install beautifulsoup4

The html5lib interpreter is required for the initialization of the Beautiful Soup object of the target web page. The command to install is as follows:

pip install html5lib

After installation, run the command directly on the command line:

scrapy

You can see the following output results. At this time, it is proved that the scratch installation is complete.

Scrapy 1.2.1 - no active project
​
Usage:
  scrapy <command> [options] [args]
​
Available commands:
  bench         Run quick benchmark test
  commands
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  ...

Project creation

Create a project named Comics in the current path through the command line

scrapy startproject Comics

After creation, the corresponding project folder appears in the current directory. You can see that the generated Comics file structure is:

|____Comics
| |______init__.py
| |______pycache__
| |____items.py
| |____pipelines.py
| |____settings.py
| |____spiders
| | |______init__.py
| | |______pycache__
|____scrapy.cfg

Ps. the command to print the current file structure is:

find . -print | sed -e 's;[^/]*/;|____;g;s;____|; |;g'

The specific functions corresponding to each file can refer to the official documents. This implementation does not involve many of these files, so the following table is not available.

1, Create Spider class

Create a class to implement the specific crawling function. All our processing implementations will be implemented in this class. It must be scrapy.Spider Subclass of.

Create in Comics/spiders file path comics.py Documents.

comics.py Specific implementation of:

#coding:utf-8
​
import scrapy
​
class Comics(scrapy.Spider):
​
    name = "comics"
​
    def start_requests(self):
        urls = ['http://www.xeall.com/shenshi']
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
​
    def parse(self, response):
        self.log(response.body);

The custom class is scrapy.Spider The name attribute is the unique identifier of the crawler, which is the parameter of the crawl command. The properties of other methods will be explained later.

2. Operation

After creating the custom class, switch to Comics path, run the command, start the crawler task and start crawling the web page.

scrapy crawl comics

The result of printing is the information in the process of running the crawler, and the html source code of the target crawling web page.

2016-11-26 22:04:35 [scrapy] INFO: Scrapy 1.2.1 started (bot: Comics)
2016-11-26 22:04:35 [scrapy] INFO: Overridden settings: {'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'Comics', 'NEWSPIDER_MODULE': 'Comics.spiders', 'SPIDER_MODULES': ['Comics.spiders']}
2016-11-26 22:04:35 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
 ...

At this point, a basic crawler is created. The following is the implementation of the specific process.

Crawling cartoon pictures

1. Starting address

The starting address of the crawler is:

http://www.xeall.com/shenshi

Our main focus is on the comic list in the middle of the page. There are controls showing the number of pages under the list. As shown in the figure below

The main task of the crawler is to crawl the pictures of each comic in the list. After crawling the current page, enter the next page of the comic list to continue crawling the comic, and continue to cycle until all comic crawling is completed.

The url of the starting address is in start_ In the urls array of the requests function. Where start_requests is a method that overloads the parent class, which is executed at the beginning of the crawler task.

start_ The main implementation of the requests method is in this line of code: the requested url is specified, and the corresponding callback function is invoked after the request is completed. self.parse

scrapy.Request(url=url, callback=self.parse)

There is actually another way to implement the previous code:

#coding:utf-8
​
import scrapy
​
class Comics(scrapy.Spider):
​
    name = "comics"
    start_urls = ['http://www.xeall.com/shenshi']
​
    def parse(self, response):
        self.log(response.body);

start_urls are the properties provided in the framework. For an array containing the url of the target web page, start is set_ After the value of URLs, start does not need to be overloaded_ Requests method, and the crawler will crawl start in turn_ URLs and automatically call parse as the callback method after the request is completed.

However, in order to call other callback functions conveniently in the process, the former implementation method is used in demo.

2. Crawl the comic url

From the start page, first we need to crawl to the url of each cartoon.

Current page comic list

The starting page is the first page of the comic list. We need to extract the required information from the current page, and use the callback parse method.

Import the beautifulsop library at the beginning

from bs4 import BeautifulSoup

The html source code returned by the request is used to initialize beautifulsop.

def parse(self, response):
    content = response.body;
    soup = BeautifulSoup(content, "html5lib")

Initialization specifies the html5lib interpreter. If it is not installed, an error will be reported.

If the specified interpreter is not provided during the initialization of beautifulsop, the best interpreter that you think matches will be used automatically. Here is a pit. The default best interpreter for the source code of the target web page is lxml. At this time, there will be problems with the parsed results, which will result in the inability to extract the next data. So when it is found that sometimes there is a problem in extracting the results, print the soup to see if it is correct.

Looking at the html source code, we can see that the part of the comic list displayed on the page is the ul tag named listcon, which can uniquely confirm the corresponding tag through the listcon class

Extract tags that contain a list of comics

listcon_tag = soup.find('ul', class_='listcon')

The find method above means to find the ul tag whose class is listcon, and returns all the contents of the corresponding tag. Find all the a tags with the attribute of href in the list tags, and these a tags are the corresponding information of each comic.

com_a_list = listcon_tag.find_all('a', attrs={'href': True})

After that, the href attribute of each comic book is synthesized into a complete accessible url address, which is saved in an array.

comics_url_list = []
base = 'http://www.xeall.com'
    for tag_a in com_a_list:
        url = base + tag_a['href']
        comics_url_list.append(url)

Now comics_ url_ The list array contains the URL of each comic on the current page.

3. Next page list

See the select page control at the bottom of the list. We can get the url of the next page through this place.

Get all the a tags with the attribute of href in the tab of the selection page

page_tag = soup.find('ul', class_='pagelist')
page_a_list = page_tag.find_all('a', attrs={'href': True})

This part of the source code is as shown in the figure below. It can be seen that the last one represents the url of the last page and the last two represents the url of the next page in all a tags. Therefore, we can select the page_a_list array to get the url of the next page.

However, it should be noted that if the current page is the last one, there is no need to take the next page. So how to judge whether the current page is the last page?

It can be judged by the select control. According to the source code, the option tag corresponding to the current page will have the selected attribute. The following figure shows that the current page is the first page

The following figure shows that the current page is the last page

By comparing the current page number with the last page number, if it is the same, it means that the current page is the last page.

select_tag = soup.find('select', attrs={'name': 'sldd'})
option_list = select_tag.find_all('option')
​
last_option = option_list[-1]
current_option = select_tag.find('option' ,attrs={'selected': True})
​
is_last = (last_option.string == current_option.string)

If it is not the last page at present, continue to do the same processing for the next page. The request is still processed by calling back the parse method

if not is_last:
    next_page = 'http://www.xeall.com/shenshi/' + page_a_list[-2]['href']
    if next_page is not None:
        print('\n------ parse next page --------')
        print(next_page)
        yield scrapy.Request(next_page, callback=self.parse)

Process each page in turn in the same way until all page processing is complete.

4. Crawling cartoon pictures

When all the comic URLs of the current page are extracted in the parse method, each comic can be processed.

Getting comics_ url_ Add the following code below the list array:

for url in comics_url_list:
    yield scrapy.Request(url=url, callback=self.comics_parse)

Request the url of each comic. The callback processing method is self.comics_parse,comics_ The parse method is used to process each cartoon. The following is the specific implementation.

5. Picture of current page

The prime minister constructs a beautiful soup from the source code returned from the request, which is basically the same as the previous one

def comics_parse(self, response):
    content = response.body;
    soup = BeautifulSoup(content, "html5lib")

Extract the tab of the select page control. The page display and source code are as follows

Extract ul tag whose class is pagelist

page_list_tag = soup.find('ul', class_='pagelist')

Viewing the source code, you can see the class attribute thisclass of the li tag of the current page, so as to obtain the current page number

current_li = page_list_tag.find('li', class_='thisclass')
page_num = current_li.a.string

Label and corresponding source code of current page picture

Get the url of the current page picture and the title of the comic. The title of the comic is followed by the name of the folder where the corresponding comic is stored.

li_tag = soup.find('li', id='imgshow')
img_tag = li_tag.find('img')
​
img_url = img_tag['src']
title = img_tag['alt']

6. Save to local

When the image url is extracted, the image can be requested through the url and saved locally

self.save_img(page_num, title, img_url)

This paper defines a special method save for saving pictures_ IMG is implemented as follows

# Forerunner storage
import os
import urllib
import zlib
​
def save_img(self, img_mun, title, img_url):
    # Save pictures locally
    self.log('saving pic: ' + img_url)
​
    # Folder to save comics
    document = '/Users/moshuqi/Desktop/cartoon'
​
    # The file name of each cartoon is named after the title
    comics_path = document + '/' + title
    exists = os.path.exists(comics_path)
    if not exists:
        self.log('create document: ' + title)
        os.makedirs(comics_path)
​
    # Each picture is named after the number of pages
    pic_name = comics_path + '/' + img_mun + '.jpg'
​
    # Check whether the picture has been downloaded to the local area. If it exists, it will not be downloaded again
    exists = os.path.exists(pic_name)
    if exists:
        self.log('pic exists: ' + pic_name)
        return
​
    try:
        user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
        headers = { 'User-Agent' : user_agent }
​
        req = urllib.request.Request(img_url, headers=headers)
        response = urllib.request.urlopen(req, timeout=30)
​
        # Data returned by request
        data = response.read()
​
        # If the returned data is compressed data, decompress it first
        if response.info().get('Content-Encoding') == 'gzip':
            data = zlib.decompress(data, 16 + zlib.MAX_WBITS)
​
        # Pictures saved locally
        fp = open(pic_name, "wb")
        fp.write(data)
        fp.close
​
        self.log('save image finished:' + pic_name)
​
    except Exception as e:
        self.log('save image error.')
        self.log(e)

The function mainly uses three parameters: the number of pages of the current picture, the name of the cartoon, and the url of the picture.

The picture will be saved in the folder named after the cartoon. If there is no corresponding folder, create one. Generally, you need to create a folder when you get the first picture.

document is a local specified folder, which can be customized.

Each picture is named in the format of. jpg. If there is a picture with the same name in the local area, it will not be downloaded again. Generally, it is used to judge when the task is started repeatedly to avoid repeated requests for the existing picture.

The image data returned by the request is compressed. You can use the response.info (). Get ('content encoding ') type. Compressed pictures need to go through first zlib.decompress Unzip and save to the local, otherwise the picture cannot be opened.

The general implementation idea is as above, and comments are attached to the code.

7. Next picture

Similar to the processing method in the comic list interface, we also need to constantly obtain the pictures on the next page in the comic page, and constantly traverse until the last page.

The last page of the comic book when the href attribute of the next page is ා

a_tag_list = page_list_tag.find_all('a')
next_page = a_tag_list[-1]['href']
if next_page == '#':
    self.log('parse comics:' + title + 'finished.')
else:
    next_page = 'http://www.xeall.com/shenshi/' + next_page
    yield scrapy.Request(next_page, callback=self.comics_parse)

If the current is the last page, the comic book traversal is completed, otherwise continue to process the next page in the same way

yield scrapy.Request(next_page, callback=self.comics_parse)

8. Operation results

The general implementation is basically completed. When running, you can see the printing situation of the console

Pictures saved to local folder

When the sketch framework is running, it uses multiple threads, and can see that multiple comics are crawling at the same time.

The resource server of the target website feels slow, and the request often times out. Please wait patiently while running.

last

This article introduces only the very basic usage of the sketch framework, as well as various detailed feature configurations.

For example, use FilesPipeline and imagesipipeline to save downloaded files or pictures.

The framework itself has an XPath class to extract web information, which is more efficient than beautiful soup. You can also save the crawled data results as a class through a special item class. Please refer to the official website for details.

Finally, the complete Demo source code is attached https://github.com/moshuqi/DemoCodes/tree/master/Comics

Original link: https://www.jianshu.com/p/c1704b4dc04d

Source network, for learning purposes only, invasion and deletion.

Don't panic. I have a set of learning materials, including 40 + E-books, 800 + teaching videos, involving Python foundation, reptile, framework, data analysis, machine learning, etc. I'm not afraid you won't learn! https://shimo.im/docs/JWCghr8prjCVCxxK/ Python learning materials

Pay attention to the official account [Python circle].

file

Posted by saloon12yrd on Wed, 20 May 2020 21:07:13 -0700