Coin station log 1 -- Python 3 crawler crawling blockchain news

Keywords: Python Blockchain JSON Windows

Coin station log 1 -- Python 3 crawler crawling blockchain news

Blockchain is very popular recently, so I want to be a news crawling and analysis type media website. I can do it as soon as I say, but I always need a data source to be a media website. Where does the data source come from? I'll write this thing later. First, crawling... Anyway, it's public information, not personal privacy.
Here, I first established several blockchain news websites, which are

  • Chain smell
  • 8btc
  • Regional media
  • Golden Finance
  • Chain to finance
    As for the crawling rules, they are the same. Here we focus on one of the crawling codes
    The following code is for Golden Finance
import urllib.request
import json
import _thread
import threading
import time
import mysql.connector
from pyquery import PyQuery as pq
import news_base

def url_open(url):
    #print(url)
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}  
    req = urllib.request.Request(url=url, headers=headers)
    for i in range(10):
        try:
            response = urllib.request.urlopen(url=req, timeout=5).read().decode('utf-8')
            return response
        except :
            print("chainnewscrawl except:")

def get_news(page_count, cb):
    time_utc = int(time.time())
    error_count = 0
    index = 0
    for i in range(1,page_count+1):
        #print(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
        response = url_open("https://api.jinse.com/v6/information/list?catelogue_key=www&limit=23&information_id=%d&flag=down&version=9.9.9&_source=www"%(index))
        #print(response)
        json_data = json.loads(response)
        for item in json_data['list']:
            if item["type"] != 1 and item["type"] != 2:
                continue
            article_item = news_base.article_info(
                item["extra"]['author'],# 
                int(item["extra"]["published_at"]),# 
                item['title'], #
                item["extra"]['summary'],#
                'content', 
                item["extra"]['topic_url'],
                "Golden money")
            source_responce = url_open(article_item.source_addr)
            source_doc = pq(source_responce)
            article_item.content = source_doc(".js-article-detail").html() if source_doc(".js-article-detail").html() else source_doc(".js-article").html()
            index = item['id']
            if not cb(article_item):
                error_count+=1
            else:
                error_count = 0
            if error_count >= 5:
                break
        if error_count >= 5:
            break
        #print(json_data['results'][0])
#def get_news(10)

#print(response)

Let's talk about a few reference libraries
among
urllib.request is a tool for crawling information through http or https
Because there is a great chance that http crawling will lead to unsuccessful opening, here is a function that I think is very useful

def url_open(url):
    #print(url)
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}  
    req = urllib.request.Request(url=url, headers=headers)
    for i in range(10):
        try:
            response = urllib.request.urlopen(url=req, timeout=5).read().decode('utf-8')
            return response
        except :
            print("chainnewscrawl except:")

It is to open this website continuously for 10 times, which basically realizes that every website crawled must be opened, and it is very easy to use
PyQuery is a web like tool for jquery to analyze websites
mysql.connector is a tool for persistent storage to database
As for news base, it needs a common data structure because it crawls several websites at the same time
As shown below

class article_info:
    def __init__(self, author, time_utc, title, desc, content, source_addr, source_media):
        self.author = author
        self.time_utc = time_utc
        self.title = title
        self.desc = desc
        self.content = content
        self.source_addr = source_addr
        self.source_media = source_media
    def __str__(self):
        return("""==========================
author:%s
time_utc:%d
title:%s
desc:%s
content:%s
source_addr:%s
source_media:%s"""%(self.author, self.time_utc, self.title, self.desc, 'self.content', self.source_addr, self.source_media))

The process of news crawling is a one-to-one http connection. After success, the speed of getting results is very slow, so multiple threads must run together, and the speed almost increases in geometric multiples. Here I open a thread for each website, and the specific code is as follows:

import db_base
import news_chainfor
import news_jinse
import news_8btc
import news_55coin
import news_chainnews
import threading

class myThread (threading.Thread):
    def __init__(self, func, arg1, arg2):
        threading.Thread.__init__(self)
        self.func = func
        self.arg1 = arg1
        self.arg2 = arg2
    def run(self):
        print ("Start thread:" + self.name)
        self.func(self.arg1, self.arg2)
        print ("Exit thread:" + self.name)
def run():
    db_base.init_db()

    thread_list = [
        myThread(news_55coin.get_news, 10, db_base.insert_article),
        myThread(news_8btc.get_news, 10, db_base.insert_article),
        myThread(news_jinse.get_news, 10, db_base.insert_article),
        myThread(news_chainfor.get_news, 10, db_base.insert_article),
        myThread(news_chainnews.get_news, 10, db_base.insert_article)
        ]
    for i in range(len(thread_list)):
        thread_list[i].start()

    for i in range(len(thread_list)):
        thread_list[i].join()

Since python used to be used less, this time it's used and learned now, so the code may be ugly, but I'm still making a fool of myself, ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha

Coin station is now online at www.bxiaozhan.com
All codes (including front and back ends) of the whole site are open-source, located at https://github.com/lihn1987/CoinCollector
I hope you can give me more advice

Posted by sotusotusotu on Sat, 07 Dec 2019 19:02:41 -0800