Using Newspaper3k framework to quickly capture article information

Keywords: Python github Google

Introduction to the Framework

Newspaper is a python 3 library, but Newspaper framework is not suitable for practical engineering news information crawling. The framework is unstable and there will be various bug s in the crawling process, such as inaccessible url, news information, etc. But for friends who want to obtain some news corpus, it is easy to try, easy to use, and do not need to master too much knowledge about crawlers. .

This is Newspaper's github link:

https://github.com/codelucas/newspaper

This is a link to the Newspaper documentation:

https://newspaper.readthedocs.io/en/latest/

This is the link to Newspaper Quick Start:

https://newspaper.readthedocs.io/en/latest/user_guide/quickstart.html

Installation method:
pip3 install newspaper3k

Two, function

The main functions are as follows:

  • Multithread article download framework
  • News Web Site Identification
  • Extracting text from html
  • Extracting top image from html
  • Extracting all images from html
  • Extract keywords from text
  • Extracting abstracts from text
  • Extracting Authors from Text
  • Google Trend Terminology Extraction.
  • Use more than 10 languages (English, Chinese, German, Arabic...)

Introduction:

1. Establishing News Sources
import newspaper
web_paper = newspaper.build("http://www.sxdi.gov.cn/gzdt/jlsc/", language="zh", memoize_articles=False)

 

Note: Article caching: By default, newspaper caches all previously extracted articles and deletes any articles it has extracted. This function is used to prevent duplication of articles and improve extraction speed. You can choose to exit this function using the memoize_articles parameter.
2. Extracting url of articles
for article in web_paper.articles:
    print(article.url)
output:
http://www.sxdi.gov.cn/gzdt/jlsc/2019101220009.html
http://www.sxdi.gov.cn/gzdt/jlsc/2019101119998.html
http://www.sxdi.gov.cn/gzdt/jlsc/2019100919989.html
http://www.sxdi.gov.cn/gzdt/jlsc/2019100819980.html
http://www.sxdi.gov.cn/gzdt/jlsc/2019092919940.html
http://www.sxdi.gov.cn/gzdt/jlsc/2019092919933.html
.... 
3. Extracting Source Categories
for category in web_paper.category_urls():
    print(category)
output:
http://www.sxdi.gov.cn/gzdt/jlsc/....
4. Extraction of Source Summary
for feed_url in web_paper.feed_urls():
    print(feed_url)

 

 

5. Extracting Source Brand and Description

print(web_paper.brand)  # brand
print(web_paper.description) # describe
print("Total acquisition%s Article" % web_paper.size())  # Number of articles
6. Download articles
from  newspaper import Article
article = Article("http://www.sol.com.cn/", language='zh')  # Chinese
article.download()
 
7. Analyze the article and extract the desired information
article.parse()  #Web page parsing
print("title=",article.title)    # Get the title of the article
print("author=", article.authors)   # Get the author of the article
print("publish_date=", article.publish_date)   # Get the article date
print("top_iamge=",article.top_image)   # Get the picture address at the top of the article
print("movies=",article.movies)   # Get article video links
print("text=",article.text,"\n")     # Get the text of the article
article.nlp()
print('keywords=',article.keywords)#Extracting keywords from text
print("summary=",article.summary)# Get abstracts of articles
print("images=",article.images)#from html Extract all images
print("imgs=",article.imgs)
print("html=",article.html)#Obtain html
 

 

Simple examples:
import newspaper
from newspaper import Article

def spider_newspaper_url(url):
    """
    //By default, newspaper caches all previously extracted articles and deletes any articles it has extracted.
    //Use the memoize_articles parameter to choose to exit this function.
    """
    web_paper = newspaper.build(url, language="zh", memoize_articles=False)
    print("Extracting news pages url!!!")
    for article in web_paper.articles:
    # Getting news pages url
        print("News page url:", article.url)
# call spider_newspaper_information Function to retrieve news page data
        spider_newspaper_information(article.url)

    print("Total acquisition%s Article" % web_paper.size())  # Number of articles

# Getting information about articles
def spider_newspaper_information(url):
    # Create links and download articles
    article = Article(url, language='zh')  # Chinese
    article.download()
    article.parse()

# Getting information about articles
    print("title=", article.title)  # Get the title of the article
    print("author=", article.authors)  # Get article author
    print("publish_date=", article.publish_date)  # Get the article date
    # print("top_iamge=", article.top_image)  # Get the picture address at the top of the article
    # print("movies=", article.movies)  # Get Video Links for Articles
    print("text=", article.text, "\n")  # Get the text of the article
    print("summary=", article.summary)  # Get abstracts of articles


if __name__ == "__main__":
    web_lists = ["http://www.sxdi.gov.cn/gzdt/jlsc/","http://www.people.com.cn/GB/59476/"]
    for web_list in web_lists:
        spider_newspaper_url(web_list)

Posted by Ryanmcgrim on Mon, 14 Oct 2019 19:48:44 -0700