Using Newspaper3k framework to quickly capture article information

Keywords: Python github Google

Introduction to the Framework

Newspaper is a python 3 library, but Newspaper framework is not suitable for practical engineering news information crawling. The framework is unstable and there will be various bug s in the crawling process, such as inaccessible url, news information, etc. But for friends who want to obtain some news corpus, it is easy to try, easy to use, and do not need to master too much knowledge about crawlers. .

This is Newspaper's github link:

https://github.com/codelucas/newspaper

This is a link to the Newspaper documentation:

https://newspaper.readthedocs.io/en/latest/

This is the link to Newspaper Quick Start:

https://newspaper.readthedocs.io/en/latest/user_guide/quickstart.html

Installation method:

pip3 install newspaper3k

Two, function

The main functions are as follows:

Multithread article download framework
News Web Site Identification
Extracting text from html
Extracting top image from html
Extracting all images from html
Extract keywords from text
Extracting abstracts from text
Extracting Authors from Text
Google Trend Terminology Extraction.
Use more than 10 languages (English, Chinese, German, Arabic...)

Introduction:

1. Establishing News Sources

import newspaper
web_paper = newspaper.build("http://www.sxdi.gov.cn/gzdt/jlsc/", language="zh", memoize_articles=False)

Note: Article caching: By default, newspaper caches all previously extracted articles and deletes any articles it has extracted. This function is used to prevent duplication of articles and improve extraction speed. You can choose to exit this function using the memoize_articles parameter.

2. Extracting url of articles

for article in web_paper.articles:
    print(article.url)
output:
http://www.sxdi.gov.cn/gzdt/jlsc/2019101220009.html
http://www.sxdi.gov.cn/gzdt/jlsc/2019101119998.html
http://www.sxdi.gov.cn/gzdt/jlsc/2019100919989.html
http://www.sxdi.gov.cn/gzdt/jlsc/2019100819980.html
http://www.sxdi.gov.cn/gzdt/jlsc/2019092919940.html
http://www.sxdi.gov.cn/gzdt/jlsc/2019092919933.html
....

3. Extracting Source Categories

for category in web_paper.category_urls():
    print(category)
output:
http://www.sxdi.gov.cn/gzdt/jlsc/....

4. Extraction of Source Summary

for feed_url in web_paper.feed_urls():
    print(feed_url)

5. Extracting Source Brand and Description

print(web_paper.brand)  # brand
print(web_paper.description) # describe
print("Total acquisition%s Article" % web_paper.size())  # Number of articles

6. Download articles

from  newspaper import Article
article = Article("http://www.sol.com.cn/", language='zh')  # Chinese
article.download()

7. Analyze the article and extract the desired information

article.parse()  #Web page parsing
print("title=",article.title)    # Get the title of the article
print("author=", article.authors)   # Get the author of the article
print("publish_date=", article.publish_date)   # Get the article date
print("top_iamge=",article.top_image)   # Get the picture address at the top of the article
print("movies=",article.movies)   # Get article video links
print("text=",article.text,"\n")     # Get the text of the article
article.nlp()
print('keywords=',article.keywords)#Extracting keywords from text
print("summary=",article.summary)# Get abstracts of articles
print("images=",article.images)#from html Extract all images
print("imgs=",article.imgs)
print("html=",article.html)#Obtain html

Simple examples:

import newspaper
from newspaper import Article

def spider_newspaper_url(url):
    """
    //By default, newspaper caches all previously extracted articles and deletes any articles it has extracted.
    //Use the memoize_articles parameter to choose to exit this function.
    """
    web_paper = newspaper.build(url, language="zh", memoize_articles=False)
    print("Extracting news pages url!!!")
    for article in web_paper.articles:
    # Getting news pages url
        print("News page url:", article.url)
# call spider_newspaper_information Function to retrieve news page data
        spider_newspaper_information(article.url)

    print("Total acquisition%s Article" % web_paper.size())  # Number of articles

# Getting information about articles
def spider_newspaper_information(url):
    # Create links and download articles
    article = Article(url, language='zh')  # Chinese
    article.download()
    article.parse()

# Getting information about articles
    print("title=", article.title)  # Get the title of the article
    print("author=", article.authors)  # Get article author
    print("publish_date=", article.publish_date)  # Get the article date
    # print("top_iamge=", article.top_image)  # Get the picture address at the top of the article
    # print("movies=", article.movies)  # Get Video Links for Articles
    print("text=", article.text, "\n")  # Get the text of the article
    print("summary=", article.summary)  # Get abstracts of articles


if __name__ == "__main__":
    web_lists = ["http://www.sxdi.gov.cn/gzdt/jlsc/","http://www.people.com.cn/GB/59476/"]
    for web_list in web_lists:
        spider_newspaper_url(web_list)

Posted by Ryanmcgrim on Mon, 14 Oct 2019 19:48:44 -0700

Programmer Group

Using Newspaper3k framework to quickly capture article information

Introduction to the Framework

https://github.com/codelucas/newspaper

Installation method:

Two, function

Multithread article download framework

News Web Site Identification

Extracting text from html

Extracting top image from html

Extracting all images from html

Extract keywords from text

Extracting abstracts from text

Extracting Authors from Text

Google Trend Terminology Extraction.

Use more than 10 languages (English, Chinese, German, Arabic...)