Scrapy Tutorial--A List of Top 3000 Articles in Blog Garden

Keywords: Python MongoDB pip github

I. Top 3000 Personnel List Pages

1) Go to the home page and find the blog Park score list. Here's the picture: Then we found the blog addresses of the top 3,000 gods. Through the analysis of word cloud, many Dashen's blogs have migrated to personal blogs.

2) Analyse the page structure: every td is a person.

The first small is ranked

The second a tag is the nickname and username, as well as the home page's blog address. User name is obtained by address interception

The fourth small tag is the number of blogs and the score, which can be obtained one by one after string separation.

3) Code: Use xpath to get tags and related content, get the home page blog address, send requests.

def parse(self, response):
        for i in response.xpath("//table[@width='90%']//td"):
            top = i.xpath(
                "./small[1]/text()").extract()[0].split('.')[-2].strip()
            nickName = i.xpath("./a[1]//text()").extract()[0].strip()
            userName = i.xpath(
                "./a[1]/@href").extract()[0].split('/')[-2].strip()
            totalAndScore = i.xpath(
                "./small[2]//text()").extract()[0].lstrip('(').rstrip(')').split(',')
            total = totalAndScore[0].strip()
            score = totalAndScore[2].strip()
#             print(top)
#             print(nickName)
#             print(userName)
#             print(total)
#             print(score)
#             return
            yield scrapy.Request(i.xpath("./a[1]/@href").extract()[0], meta={'page': 1, 'top': top, 'nickName': nickName, 'userName': userName, 'score': score},
                                 callback=self.parse_page)

II. Personnel Blog List Pages

1) Page structure: Through analysis, each blog's a tag id contains "TitleUrl", so that you can get the address of each blog. For each page address, add default.html?page=2, and the page changes accordingly.

2) Code: The top text will be removed.

def parse_page(self, response):
        #         print(response.meta['nickName'])
        #//a[contains(@id,'TitleUrl')]
        urlArr = response.url.split('default.aspx?')
        if len(urlArr) > 1:
            baseUrl = urlArr[-2]
        else:
            baseUrl = response.url
        list = response.xpath("//a[contains(@id,'TitleUrl')]")
        for i in list:
            item = CnblogsItem()
            item['top'] = int(response.meta['top'])
            item['nickName'] = response.meta['nickName']
            item['userName'] = response.meta['userName']
            item['score'] = int(response.meta['score'])
            item['pageLink'] = response.url
            item['title'] = i.xpath(
                "./text()").extract()[0].replace(u'[Roof placement]', '').strip()
            item['articleLink'] = i.xpath("./@href").extract()[0]
            yield item
        if len(list) > 0:
            response.meta['page'] += 1
            yield scrapy.Request(baseUrl + 'default.aspx?page=' + str(response.meta['page']), meta={'page': response.meta['page'], 'top': response.meta['top'], 'nickName': response.meta['nickName'], 'userName': response.meta['userName'],  'score': response.meta['score']}, callback=self.parse_page)

3) For each blog content, there is no crawl here. Simple, too. Analyse the page. Continue sending the request and find the div with the id cnblogs_post_body.

Data Storage MongoDB

This part is not difficult. Remember to install pymongo, pip install pymongo. There are over 800,000 articles in total.

from cnblogs.items import CnblogsItem
import pymongo


class CnblogsPipeline(object):

    def __init__(self):
        client = pymongo.MongoClient(host='127.0.0.1', port=27017)
        dbName = client['cnblogs']
        self.table = dbName['articles']
        self.table.create

    def process_item(self, item, spider):
        if isinstance(item, CnblogsItem):
            self.table.insert(dict(item))
            return item

IV. Agents and Model Classes

The proxy in scrapy is very simple. Customize a download middleware and specify the proxy ip and port.

def process_request(self, request, spider):
        request.meta['proxy'] = 'http://117.143.109.173:80'

Model class, which stores the corresponding fields.

class CnblogsItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # ranking
    top = scrapy.Field()
    nickName = scrapy.Field()
    userName = scrapy.Field()
    # integral
    score = scrapy.Field()
    # Location page number address
    pageLink = scrapy.Field()
    # Title of article
    title = scrapy.Field()
    # Links to articles
    articleLink = scrapy.Field()

V. wordcloud Cloud Analysis

Everyone's articles are analyzed by word cloud and stored as pictures. The use of wordcloud can refer to the articles in the park.

Multithreading is used here, one thread is used to generate the txt text with good segmentation, and one thread is used to generate word cloud images. The generated word clouds are about one second each.

# coding=utf-8
import sys
import jieba
from wordcloud import WordCloud
import pymongo
import threading
from Queue import Queue
import datetime
import os
reload(sys)
sys.setdefaultencoding('utf-8')


class MyThread(threading.Thread):

    def __init__(self, func, args):
        threading.Thread.__init__(self)
        self.func = func
        self.args = args

    def run(self):
        apply(self.func, self.args)
# Getting content threads


def getTitle(queue, table):
    for j in range(1, 3001):
        #         start = datetime.datetime.now()
        list = table.find({'top': j}, {'title': 1, 'top': 1, 'nickName': 1})
        if list.count() == 0:
            continue
        txt = ''
        for i in list:
            txt += str(i['title']) + '\n'
            name = i['nickName']
            top = i['top']
        txt = ' '.join(jieba.cut(txt))
        queue.put((txt, name, top), 1)
#         print((datetime.datetime.now() - start).seconds)


def getImg(queue, word):
    for i in range(1, 3001):
        #         start = datetime.datetime.now()
        get = queue.get(1)
        word.generate(get[0])
        name = get[1].replace('<', '').replace('>', '').replace('/', '').replace('\\', '').replace(
            '|', '').replace(':', '').replace('"', '').replace('*', '').replace('?', '')
        word.to_file(
            'wordcloudimgs/' + str(get[2]) + '-' + str(name).decode('utf-8') + '.jpg')
        print(str(get[1]).decode('utf-8') + '\t Generation Success')
#         print((datetime.datetime.now() - start).seconds)


def main():
    client = pymongo.MongoClient(host='127.0.0.1', port=27017)
    dbName = client['cnblogs']
    table = dbName['articles']
    wc = WordCloud(
        font_path='msyh.ttc', background_color='#ccc', width=600, height=600)
    if not os.path.exists('wordcloudimgs'):
        os.mkdir('wordcloudimgs')
    threads = []
    queue = Queue()
    titleThread = MyThread(getTitle, (queue, table))
    imgThread = MyThread(getImg, (queue, wc))
    threads.append(imgThread)
    threads.append(titleThread)

    for t in threads:
        t.start()
    for t in threads:
        t.join()

if __name__ == "__main__":
    main()

6. Complete source address

  https://github.com/hao15239129517/cnblogs

Posted by Notoriouswow on Sat, 22 Jun 2019 15:50:40 -0700