Introduction to the Framework
Newspaper is a python 3 library, but Newspaper framework is not suitable for practical engineering news information crawling. The framework is unstable and there will be various bug s in the crawling process, such as inaccessible url, news information, etc. But for friends who want to obtain some news corpus, it is easy to try, easy to use, and do not need to master too much knowledge about crawlers. .
This is Newspaper's github link:
https://github.com/codelucas/newspaper
This is a link to the Newspaper documentation:
https://newspaper.readthedocs.io/en/latest/
This is the link to Newspaper Quick Start:
https://newspaper.readthedocs.io/en/latest/user_guide/quickstart.html
Installation method:
pip3 install newspaper3k
Two, function
The main functions are as follows:
-
Multithread article download framework
-
News Web Site Identification
-
Extracting text from html
-
Extracting top image from html
-
Extracting all images from html
-
Extract keywords from text
-
Extracting abstracts from text
-
Extracting Authors from Text
-
Google Trend Terminology Extraction.
-
Use more than 10 languages (English, Chinese, German, Arabic...)
Introduction:
1. Establishing News Sources
import newspaper web_paper = newspaper.build("http://www.sxdi.gov.cn/gzdt/jlsc/", language="zh", memoize_articles=False)
Note: Article caching: By default, newspaper caches all previously extracted articles and deletes any articles it has extracted. This function is used to prevent duplication of articles and improve extraction speed. You can choose to exit this function using the memoize_articles parameter.
2. Extracting url of articles
for article in web_paper.articles: print(article.url) output: http://www.sxdi.gov.cn/gzdt/jlsc/2019101220009.html http://www.sxdi.gov.cn/gzdt/jlsc/2019101119998.html http://www.sxdi.gov.cn/gzdt/jlsc/2019100919989.html http://www.sxdi.gov.cn/gzdt/jlsc/2019100819980.html http://www.sxdi.gov.cn/gzdt/jlsc/2019092919940.html http://www.sxdi.gov.cn/gzdt/jlsc/2019092919933.html ....
3. Extracting Source Categories
for category in web_paper.category_urls(): print(category) output: http://www.sxdi.gov.cn/gzdt/jlsc/....
4. Extraction of Source Summary
for feed_url in web_paper.feed_urls(): print(feed_url)
5. Extracting Source Brand and Description
print(web_paper.brand) # brand print(web_paper.description) # describe print("Total acquisition%s Article" % web_paper.size()) # Number of articles
6. Download articles
from newspaper import Article article = Article("http://www.sol.com.cn/", language='zh') # Chinese article.download()
7. Analyze the article and extract the desired information
article.parse() #Web page parsing print("title=",article.title) # Get the title of the article print("author=", article.authors) # Get the author of the article print("publish_date=", article.publish_date) # Get the article date print("top_iamge=",article.top_image) # Get the picture address at the top of the article print("movies=",article.movies) # Get article video links print("text=",article.text,"\n") # Get the text of the article article.nlp() print('keywords=',article.keywords)#Extracting keywords from text print("summary=",article.summary)# Get abstracts of articles print("images=",article.images)#from html Extract all images print("imgs=",article.imgs) print("html=",article.html)#Obtain html
Simple examples:
import newspaper from newspaper import Article def spider_newspaper_url(url): """ //By default, newspaper caches all previously extracted articles and deletes any articles it has extracted. //Use the memoize_articles parameter to choose to exit this function. """ web_paper = newspaper.build(url, language="zh", memoize_articles=False) print("Extracting news pages url!!!") for article in web_paper.articles: # Getting news pages url print("News page url:", article.url) # call spider_newspaper_information Function to retrieve news page data spider_newspaper_information(article.url) print("Total acquisition%s Article" % web_paper.size()) # Number of articles # Getting information about articles def spider_newspaper_information(url): # Create links and download articles article = Article(url, language='zh') # Chinese article.download() article.parse() # Getting information about articles print("title=", article.title) # Get the title of the article print("author=", article.authors) # Get article author print("publish_date=", article.publish_date) # Get the article date # print("top_iamge=", article.top_image) # Get the picture address at the top of the article # print("movies=", article.movies) # Get Video Links for Articles print("text=", article.text, "\n") # Get the text of the article print("summary=", article.summary) # Get abstracts of articles if __name__ == "__main__": web_lists = ["http://www.sxdi.gov.cn/gzdt/jlsc/","http://www.people.com.cn/GB/59476/"] for web_list in web_lists: spider_newspaper_url(web_list)