I. Top 3000 Personnel List Pages
1) Go to the home page and find the blog Park score list. Here's the picture: Then we found the blog addresses of the top 3,000 gods. Through the analysis of word cloud, many Dashen's blogs have migrated to personal blogs.
2) Analyse the page structure: every td is a person.
The first small is ranked
The second a tag is the nickname and username, as well as the home page's blog address. User name is obtained by address interception
The fourth small tag is the number of blogs and the score, which can be obtained one by one after string separation.
3) Code: Use xpath to get tags and related content, get the home page blog address, send requests.
def parse(self, response): for i in response.xpath("//table[@width='90%']//td"): top = i.xpath( "./small[1]/text()").extract()[0].split('.')[-2].strip() nickName = i.xpath("./a[1]//text()").extract()[0].strip() userName = i.xpath( "./a[1]/@href").extract()[0].split('/')[-2].strip() totalAndScore = i.xpath( "./small[2]//text()").extract()[0].lstrip('(').rstrip(')').split(',') total = totalAndScore[0].strip() score = totalAndScore[2].strip() # print(top) # print(nickName) # print(userName) # print(total) # print(score) # return yield scrapy.Request(i.xpath("./a[1]/@href").extract()[0], meta={'page': 1, 'top': top, 'nickName': nickName, 'userName': userName, 'score': score}, callback=self.parse_page)
II. Personnel Blog List Pages
1) Page structure: Through analysis, each blog's a tag id contains "TitleUrl", so that you can get the address of each blog. For each page address, add default.html?page=2, and the page changes accordingly.
2) Code: The top text will be removed.
def parse_page(self, response): # print(response.meta['nickName']) #//a[contains(@id,'TitleUrl')] urlArr = response.url.split('default.aspx?') if len(urlArr) > 1: baseUrl = urlArr[-2] else: baseUrl = response.url list = response.xpath("//a[contains(@id,'TitleUrl')]") for i in list: item = CnblogsItem() item['top'] = int(response.meta['top']) item['nickName'] = response.meta['nickName'] item['userName'] = response.meta['userName'] item['score'] = int(response.meta['score']) item['pageLink'] = response.url item['title'] = i.xpath( "./text()").extract()[0].replace(u'[Roof placement]', '').strip() item['articleLink'] = i.xpath("./@href").extract()[0] yield item if len(list) > 0: response.meta['page'] += 1 yield scrapy.Request(baseUrl + 'default.aspx?page=' + str(response.meta['page']), meta={'page': response.meta['page'], 'top': response.meta['top'], 'nickName': response.meta['nickName'], 'userName': response.meta['userName'], 'score': response.meta['score']}, callback=self.parse_page)
3) For each blog content, there is no crawl here. Simple, too. Analyse the page. Continue sending the request and find the div with the id cnblogs_post_body.
Data Storage MongoDB
This part is not difficult. Remember to install pymongo, pip install pymongo. There are over 800,000 articles in total.
from cnblogs.items import CnblogsItem import pymongo class CnblogsPipeline(object): def __init__(self): client = pymongo.MongoClient(host='127.0.0.1', port=27017) dbName = client['cnblogs'] self.table = dbName['articles'] self.table.create def process_item(self, item, spider): if isinstance(item, CnblogsItem): self.table.insert(dict(item)) return item
IV. Agents and Model Classes
The proxy in scrapy is very simple. Customize a download middleware and specify the proxy ip and port.
def process_request(self, request, spider): request.meta['proxy'] = 'http://117.143.109.173:80'
Model class, which stores the corresponding fields.
class CnblogsItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() # ranking top = scrapy.Field() nickName = scrapy.Field() userName = scrapy.Field() # integral score = scrapy.Field() # Location page number address pageLink = scrapy.Field() # Title of article title = scrapy.Field() # Links to articles articleLink = scrapy.Field()
V. wordcloud Cloud Analysis
Everyone's articles are analyzed by word cloud and stored as pictures. The use of wordcloud can refer to the articles in the park.
Multithreading is used here, one thread is used to generate the txt text with good segmentation, and one thread is used to generate word cloud images. The generated word clouds are about one second each.
# coding=utf-8 import sys import jieba from wordcloud import WordCloud import pymongo import threading from Queue import Queue import datetime import os reload(sys) sys.setdefaultencoding('utf-8') class MyThread(threading.Thread): def __init__(self, func, args): threading.Thread.__init__(self) self.func = func self.args = args def run(self): apply(self.func, self.args) # Getting content threads def getTitle(queue, table): for j in range(1, 3001): # start = datetime.datetime.now() list = table.find({'top': j}, {'title': 1, 'top': 1, 'nickName': 1}) if list.count() == 0: continue txt = '' for i in list: txt += str(i['title']) + '\n' name = i['nickName'] top = i['top'] txt = ' '.join(jieba.cut(txt)) queue.put((txt, name, top), 1) # print((datetime.datetime.now() - start).seconds) def getImg(queue, word): for i in range(1, 3001): # start = datetime.datetime.now() get = queue.get(1) word.generate(get[0]) name = get[1].replace('<', '').replace('>', '').replace('/', '').replace('\\', '').replace( '|', '').replace(':', '').replace('"', '').replace('*', '').replace('?', '') word.to_file( 'wordcloudimgs/' + str(get[2]) + '-' + str(name).decode('utf-8') + '.jpg') print(str(get[1]).decode('utf-8') + '\t Generation Success') # print((datetime.datetime.now() - start).seconds) def main(): client = pymongo.MongoClient(host='127.0.0.1', port=27017) dbName = client['cnblogs'] table = dbName['articles'] wc = WordCloud( font_path='msyh.ttc', background_color='#ccc', width=600, height=600) if not os.path.exists('wordcloudimgs'): os.mkdir('wordcloudimgs') threads = [] queue = Queue() titleThread = MyThread(getTitle, (queue, table)) imgThread = MyThread(getImg, (queue, wc)) threads.append(imgThread) threads.append(titleThread) for t in threads: t.start() for t in threads: t.join() if __name__ == "__main__": main()
6. Complete source address
https://github.com/hao15239129517/cnblogs