Keyword extraction. The pynlpir library implements keyword extraction.
# coding:utf-8 import sys import importlib importlib.reload(sys) import pynlpir pynlpir.open() s = 'How to delete the junk files in the computer' key_words = pynlpir.get_key_words(s, weighted=True) for key_word in key_words: print(key_word[0], 't', key_word[1]) pynlpir.close()
Baidu interface: https://www.baidu.com/s?wd = machine learning Data mining information retrieval
Install scrape PIP install scrape. Create a scratch project, scratch startproject Baidu? Search. As a crawler, create the Baidu search / Baidu search / spiders / Baidu search.py file.
# coding:utf-8 import sys import importlib importlib.reload(sys) import scrapy class BaiduSearchSpider(scrapy.Spider): name = "baidu_search" allowed_domains = ["baidu.com"] start_urls = [ "https://www.baidu.com/s?wd = computer junk file deletion“ ] def parse(self, response): filename = "result.html" with open(filename, 'wb') as f: f.write(response.body)
Modify settings.py file, robotstxt'obey = false, user'agent ='mozilla / 5.0 (Macintosh; Intel Mac OS X 10'11'4) applewebkit / 537.36 (KHTML, like gecko) Chrome / 50.0.2661.102 Safari / 537.36 ', download'timeout = 5,
Go to the Baidu search / Baidu search / directory, and search. Generate result.html and grab the web page correctly.
Corpus extraction. Search results are just indexes. The real content needs to be linked. Analyze the grab results, and the link is embedded in the class=c-container Div h3 a tag's href attribute. url is added to the crawl queue. Extract the text, remove the label, and save the summary. When the url is extracted, the title and summary are extracted, and the summary.request meta is passed to the parse uurl processing function. After the capture, the two values can be received to extract content. Complete data: url, title, abstract, content.
# coding:utf-8 import sys import importlib importlib.reload(sys) import scrapy from scrapy.utils.markup import remove_tags class BaiduSearchSpider(scrapy.Spider): name = "baidu_search" allowed_domains = ["baidu.com"] start_urls = [ "https://www.baidu.com/s?wd = computer junk file deletion“ ] def parse(self, response): # filename = "result.html" # with open(filename, 'wb') as f: # f.write(response.body) hrefs = response.selector.xpath('//div[contains([@class](https://my.oschina.net/liwenlong7758), "c-container")]/h3/a/[@href](https://my.oschina.net/href)').extract() # for href in hrefs: # print(href) # yield scrapy.Request(href, callback=self.parse_url) containers = response.selector.xpath('//div[contains([@class](https://my.oschina.net/liwenlong7758), "c-container")]') for container in containers: href = container.xpath('h3/a/[@href](https://my.oschina.net/href)').extract()[0] title = remove_tags(container.xpath('h3/a').extract()[0]) c_abstract = container.xpath('div/div/div[contains([@class](https://my.oschina.net/liwenlong7758), "c-abstract")]').extract() abstract = "" if len(c_abstract) > 0: abstract = remove_tags(c_abstract[0]) request = scrapy.Request(href, callback=self.parse_url) request.meta['title'] = title request.meta['abstract'] = abstract yield request def parse_url(self, response): print(len(response.body)) print("url:", response.url) print("title:", response.meta['title']) print("abstract:", response.meta['abstract']) content = remove_tags(response.selector.xpath('//body').extract()[0]) print("content_len:", len(content))
reference material:
Python natural language processing
http://www.shareditor.com/blogshow/?blogId=43
http://www.shareditor.com/blogshow?blogId=76
Welcome to recommend Shanghai machine learning job opportunities, my wechat: qingxingfengzi