Coin station log 1 -- Python 3 crawler crawling blockchain news
Blockchain is very popular recently, so I want to be a news crawling and analysis type media website. I can do it as soon as I say, but I always need a data source to be a media website. Where does the data source come from? I'll write this thing later. First, crawling... Anyway, it's public information, not personal privacy.
Here, I first established several blockchain news websites, which are
- Chain smell
- 8btc
- Regional media
- Golden Finance
- Chain to finance
As for the crawling rules, they are the same. Here we focus on one of the crawling codes
The following code is for Golden Finance
import urllib.request import json import _thread import threading import time import mysql.connector from pyquery import PyQuery as pq import news_base def url_open(url): #print(url) headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'} req = urllib.request.Request(url=url, headers=headers) for i in range(10): try: response = urllib.request.urlopen(url=req, timeout=5).read().decode('utf-8') return response except : print("chainnewscrawl except:") def get_news(page_count, cb): time_utc = int(time.time()) error_count = 0 index = 0 for i in range(1,page_count+1): #print(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>") response = url_open("https://api.jinse.com/v6/information/list?catelogue_key=www&limit=23&information_id=%d&flag=down&version=9.9.9&_source=www"%(index)) #print(response) json_data = json.loads(response) for item in json_data['list']: if item["type"] != 1 and item["type"] != 2: continue article_item = news_base.article_info( item["extra"]['author'],# int(item["extra"]["published_at"]),# item['title'], # item["extra"]['summary'],# 'content', item["extra"]['topic_url'], "Golden money") source_responce = url_open(article_item.source_addr) source_doc = pq(source_responce) article_item.content = source_doc(".js-article-detail").html() if source_doc(".js-article-detail").html() else source_doc(".js-article").html() index = item['id'] if not cb(article_item): error_count+=1 else: error_count = 0 if error_count >= 5: break if error_count >= 5: break #print(json_data['results'][0]) #def get_news(10) #print(response)
Let's talk about a few reference libraries
among
urllib.request is a tool for crawling information through http or https
Because there is a great chance that http crawling will lead to unsuccessful opening, here is a function that I think is very useful
def url_open(url): #print(url) headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'} req = urllib.request.Request(url=url, headers=headers) for i in range(10): try: response = urllib.request.urlopen(url=req, timeout=5).read().decode('utf-8') return response except : print("chainnewscrawl except:")
It is to open this website continuously for 10 times, which basically realizes that every website crawled must be opened, and it is very easy to use
PyQuery is a web like tool for jquery to analyze websites
mysql.connector is a tool for persistent storage to database
As for news base, it needs a common data structure because it crawls several websites at the same time
As shown below
class article_info: def __init__(self, author, time_utc, title, desc, content, source_addr, source_media): self.author = author self.time_utc = time_utc self.title = title self.desc = desc self.content = content self.source_addr = source_addr self.source_media = source_media def __str__(self): return("""========================== author:%s time_utc:%d title:%s desc:%s content:%s source_addr:%s source_media:%s"""%(self.author, self.time_utc, self.title, self.desc, 'self.content', self.source_addr, self.source_media))
The process of news crawling is a one-to-one http connection. After success, the speed of getting results is very slow, so multiple threads must run together, and the speed almost increases in geometric multiples. Here I open a thread for each website, and the specific code is as follows:
import db_base import news_chainfor import news_jinse import news_8btc import news_55coin import news_chainnews import threading class myThread (threading.Thread): def __init__(self, func, arg1, arg2): threading.Thread.__init__(self) self.func = func self.arg1 = arg1 self.arg2 = arg2 def run(self): print ("Start thread:" + self.name) self.func(self.arg1, self.arg2) print ("Exit thread:" + self.name) def run(): db_base.init_db() thread_list = [ myThread(news_55coin.get_news, 10, db_base.insert_article), myThread(news_8btc.get_news, 10, db_base.insert_article), myThread(news_jinse.get_news, 10, db_base.insert_article), myThread(news_chainfor.get_news, 10, db_base.insert_article), myThread(news_chainnews.get_news, 10, db_base.insert_article) ] for i in range(len(thread_list)): thread_list[i].start() for i in range(len(thread_list)): thread_list[i].join()
Since python used to be used less, this time it's used and learned now, so the code may be ugly, but I'm still making a fool of myself, ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha
Coin station is now online at www.bxiaozhan.com
All codes (including front and back ends) of the whole site are open-source, located at https://github.com/lihn1987/CoinCollector
I hope you can give me more advice