Introduction to Python Crawler [18]: 36 Kr (36 kr) data grabbing scrapy

1. 36 Kr data - written in front

Today, grabbing a news media, 36kr article content, is also to prepare for the subsequent data analysis.

36kr allows some people to see the future first, and what you are going to do today is really to grasp its past.

2. 36 Kr data - Data Analysis

36kr page is a waterfall flow effect, when you constantly drop down the page, the data from the background is added, based on this, we can basically judge that it is ajax asynchronous data, just open the developer tools, you can quickly locate the desired data, let's try!

The capture links are as follows

https://36kr.com/api/search-column/mainsite?per_page=20&page=1&_=1543840108547
https://36kr.com/api/search-column/mainsite?per_page=20&page=2&_=1543840108547
https://36kr.com/api/search-column/mainsite?per_page=20&page=3&_=1543840108547
https://36kr.com/api/search-column/mainsite?per_page=20&page=4&_=1543840108547

After many attempts, we found that per_page can be extended to 300, but when the data is greater than 100, the returned data is not very ideal, so we plan to set it as 100, page is the page number, which can be repeated.

The above parameter also has a more important value, called total_count, for the total number of articles. With this parameter, we can quickly stitch out the desired page number.

3. 36 Kr data - creating scrapy project

scrapy startproject kr36

4. 36 Kr data - creating crawler entry pages

scrapy genspider Kr36 "www.gaokaopai.com"
Python Resource sharing qun 784758214 ,Installation packages are included. PDF，Learning videos, here is Python The gathering place of learners, zero foundation and advanced level are all welcomed.

5. 36 Kr data - writing url generator

The start_urls of the page is the first page of data, and then the parse function is called. In the content of the function, we get the parameter total_count.
Here, you need to note that the return data of yield is Request(), for a detailed description of him, see
https://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/request-response.html

All parameter lists and parameter names are well-named and basically represent all the meanings. More importantly, url and callback

class scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback])

class Kr36Spider(scrapy.Spider):
    name = 'Kr36'
    allowed_domains = ['36kr.com']

    start_urls = ['https://36kr.com/api/search-column/mainsite?per_page=100&page=1&_=']
    def parse(self, response):
        data = json.loads(response.body_as_unicode())
        totle = int(data["data"]["total_count"])
        #totle = 201

        for page in range(2,int(totle/100)+2):
            print("Crawling{}page".format(page),end="")
            yield Request("https://36kr.com/api/search-column/mainsite?per_page=100&page={}&_=".format(str(page)), callback=self.parse_item)

6. 36 Kr data - parsing data

In the process of parsing data, it is found that sometimes there are missing data, so it is necessary to judge whether app_views_count, mobile_views_count, views_count, favorite_num appear in the dictionary.

Notice the Kr36Item class in the following code, which needs to be created in advance

Kr36Item

class Kr36Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    app_views_count = scrapy.Field() # APP View Quantity
    mobile_views_count = scrapy.Field() # Number of mobile viewers
    views_count = scrapy.Field() # PC View Quantity
    column_name = scrapy.Field() # category
    favourite_num = scrapy.Field() # Collection Quantity
    title = scrapy.Field() # Title
    published_at = scrapy.Field() # Release time
    is_free = scrapy.Field() # Free or not
    username = scrapy.Field()

    def parse_item(self,response):

        data = json.loads(response.body_as_unicode())
        item = Kr36Item()
        for one_item in data["data"]["items"]:
            print(one_item)
            item["app_views_count"] = one_item["app_views_count"] if "app_views_count" in one_item else 0# APP View Quantity
            item["mobile_views_count"] = one_item["mobile_views_count"]  if "mobile_views_count" in one_item else 0 # Number of mobile viewers
            item["views_count"] = one_item["views_count"]  if "views_count" in one_item else 0  # PC View Quantity
            item["column_name"] = one_item["column_name"]  # category
            item["favourite_num"] = one_item["favourite_num"]  if "favourite_num" in one_item else 0  # Collection Quantity
            item["title"] = one_item["title"] # Title
            item["published_at"] = one_item["published_at"]  # Release time
            item["is_free"] = one_item["is_free"] if "is_free" in one_item else 0# Free or not
            item["username"] = json.loads(one_item["user_info"])["name"]
            yield item

Finally, open pipelines in settings.py to write data persistence code

ITEM_PIPELINES = {
   'kr36.pipelines.Kr36Pipeline': 300,
}

import os
import csv

class Kr36Pipeline(object):
    def __init__(self):
        store_file = os.path.dirname(__file__)+'/spiders/36kr.csv'
        self.file = open(store_file,"a+",newline="",encoding="utf_8_sig")
        self.writer = csv.writer(self.file)
    def process_item(self, item, spider):
        try:
            self.writer.writerow((
                item["title"],
                item["app_views_count"],
                item["mobile_views_count"],
                item["views_count"],
                item["column_name"],
                item["favourite_num"],
                item["published_at"],
                item["is_free"],
                item["username"]
            ))
            print("Data Storage Completed")
        except Exception as e:
            print(e.args)

    def close_spider(self,spider):
        self.file.close()
Python Resource sharing qun 784758214 ,Installation packages are included. PDF，Learning videos, here is Python The gathering place of learners, zero foundation and advanced level are all welcomed.

7. 36 krypton (36kr) data - Data Acquisition

Running the above code has not done too much processing, nor has it adjusted the concurrency speed, nor has it done anti-crawling measures. Running for a while, we got about 69936 pieces of data, which is more than 300 pieces worse than the forecast. It's not a big problem. The reason is not carefully investigated.

Posted by TheHaloEffect on Fri, 26 Jul 2019 02:14:24 -0700

Programmer Group