Learning notes CB005: key words, corpus extraction

Keyword extraction. The pynlpir library implements keyword extraction.

# coding:utf-8

import sys
import importlib
importlib.reload(sys)

import pynlpir

pynlpir.open()
s = 'How to delete the junk files in the computer'

key_words = pynlpir.get_key_words(s, weighted=True)
for key_word in key_words:
    print(key_word[0], 't', key_word[1])

pynlpir.close()

Baidu interface: https://www.baidu.com/s?wd = machine learning Data mining information retrieval

Install scrape PIP install scrape. Create a scratch project, scratch startproject Baidu? Search. As a crawler, create the Baidu search / Baidu search / spiders / Baidu search.py file.

# coding:utf-8

import sys
import importlib
importlib.reload(sys)

import scrapy

class BaiduSearchSpider(scrapy.Spider):
    name = "baidu_search"
    allowed_domains = ["baidu.com"]
    start_urls = [
            "https://www.baidu.com/s?wd = computer junk file deletion“
    ]

    def parse(self, response):
        filename = "result.html"
        with open(filename, 'wb') as f:
            f.write(response.body)

Modify settings.py file, robotstxt'obey = false, user'agent ='mozilla / 5.0 (Macintosh; Intel Mac OS X 10'11'4) applewebkit / 537.36 (KHTML, like gecko) Chrome / 50.0.2661.102 Safari / 537.36 ', download'timeout = 5,

Go to the Baidu search / Baidu search / directory, and search. Generate result.html and grab the web page correctly.

Corpus extraction. Search results are just indexes. The real content needs to be linked. Analyze the grab results, and the link is embedded in the class=c-container Div h3 a tag's href attribute. url is added to the crawl queue. Extract the text, remove the label, and save the summary. When the url is extracted, the title and summary are extracted, and the summary.request meta is passed to the parse uurl processing function. After the capture, the two values can be received to extract content. Complete data: url, title, abstract, content.

# coding:utf-8

import sys
import importlib
importlib.reload(sys)

import scrapy
from scrapy.utils.markup import remove_tags

class BaiduSearchSpider(scrapy.Spider):
    name = "baidu_search"
    allowed_domains = ["baidu.com"]
    start_urls = [
            "https://www.baidu.com/s?wd = computer junk file deletion“
    ]

    def parse(self, response):
        # filename = "result.html"
        # with open(filename, 'wb') as f:
        #     f.write(response.body)
        hrefs = response.selector.xpath('//div[contains([@class](https://my.oschina.net/liwenlong7758), "c-container")]/h3/a/[@href](https://my.oschina.net/href)').extract()
        # for href in hrefs:
        #     print(href)
        #     yield scrapy.Request(href, callback=self.parse_url)
        containers = response.selector.xpath('//div[contains([@class](https://my.oschina.net/liwenlong7758), "c-container")]')
        for container in containers:
            href = container.xpath('h3/a/[@href](https://my.oschina.net/href)').extract()[0]
            title = remove_tags(container.xpath('h3/a').extract()[0])
            c_abstract = container.xpath('div/div/div[contains([@class](https://my.oschina.net/liwenlong7758), "c-abstract")]').extract()
            abstract = ""
            if len(c_abstract) > 0:
                abstract = remove_tags(c_abstract[0])
            request = scrapy.Request(href, callback=self.parse_url)
            request.meta['title'] = title
            request.meta['abstract'] = abstract
            yield request

    def parse_url(self, response):
        print(len(response.body))
        print("url:", response.url)
        print("title:", response.meta['title'])
        print("abstract:", response.meta['abstract'])
        content = remove_tags(response.selector.xpath('//body').extract()[0])
        print("content_len:", len(content))

reference material:

Python natural language processing

http://www.shareditor.com/blogshow/?blogId=43

http://www.shareditor.com/blogshow?blogId=76

Welcome to recommend Shanghai machine learning job opportunities, my wechat: qingxingfengzi

Posted by Cloud9247 on Thu, 02 Apr 2020 09:32:17 -0700

Programmer Group

Learning notes CB005: key words, corpus extraction

Hot Keywords