Fourth -- using selenium to capture and analyze stock data

Keywords: Python JSON encoding Selenium

This article is the fourth in a series of articles "from introduction to persuasion". It can also be used as the previous one
Puppeter application
Follow up.

Readers of this article: primary users of python, students who want to learn about crawlers or data grabbing. Want to know the users of selinum and beiotifulsoup

Background:

python is good at data processing. There are some excellent databases such as numpy and pandas. Let's do an example experiment. I'm interested in the economy. So I'll take stock market data analysis and through the statistical analysis of historical data to see whether I can find out which indicators of a listed company are the most influential factors to determine its stock trend.

Then, where does the data come from? I'll grab it from the Internet. Then I compare the convenience of data acquisition on Tencent stock channel and Dongfang fortune. I choose the data source of tonghuashun, get the data through the selinum request, and get the desired words through the dom of the beautifalsoup Analysis page. Let's start happily

Data acquisition process

step1: get the paging list of all stocks, and extract the two basic information of stock code and stock Chinese name in each line.
Step 2: some indicators are not in the list, so we need to request the company details page of each stock to extract the main business, region, total market value, current market value, P / E ratio and P / N ratio. The above information is put into storage to form a basic company information table
Step 3: get the quarterly report information of each stock, and make the quarterly report for storage.
Step 4: get the weekly data of each stock, and make the weekly up and down table for storage. Considering that the daily data fluctuation is more occasional and there is no lack of lines, the daily up and down data is not selected for warehousing. If the monthly data warehousing time span is too long

code analysis

Corresponding to the above four steps of data acquisition process, the following is divided into four code blocks

List data acquisition and analysis

The analysis is the linked data List of listed companies

import time
import re
from selenium import webdriver
from bs4 import BeautifulSoup
from lwy.stock.dao.company import Company

#Shanghai a, Shenzhen A and Shenzhen SME board
SHA = "http://q.10jqka.com.cn/index/index/board/hs/field/zdf/order/desc/page/{0}/ajax/1/"
SZA = "http://q.10jqka.com.cn/index/index/board/ss/field/zdf/order/desc/page/{0}/ajax/1/"
SZZX = "http://q.10jqka.com.cn/index/index/board/zxb/field/zdf/order/desc/page/{0}/ajax/1/"
#Combination acquisition, return all stock data
def getAllStock():
    #pageOne(SZA, 71)
    #pageOne(SZA, 24)
    pageOne(SZZX, 1)

#Cycle to get data by page
def pageOne(url,pagenum):
    driver = webdriver.Chrome("./lib/chromedriver.exe")

    detail_links = []
    for page in range(5,pagenum):
        print("now pagenum is : ",page)
        driver.get(url.format(page))
        detail_links = anaList(driver.page_source)
        time.sleep(15)
        #break #One page first
        #Loop list connection to get all company details and update
        #for link in detail_links:
        #    _snatchDetail(driver,link)

    driver.quit()

#HTML STR obtained by using bs analysis
def anaList(htmlstr):
    bf = BeautifulSoup(htmlstr,"html.parser")
    trs = bf.select("tbody tr")
    #Company details link
    comp_links = []
    #trs = bf.find("tbody").children
    for tr in trs:
        #14 elements in total
        astock = {}
        ind = 1
        #print("tr:",tr)
        tds = tr.find_all("td")
        for td in tds:
            if ind == 2: #gp code
                astock["stock_code"] = td.text
                comp_links.append("http://stockpage.10jqka.com.cn/{0}/company/".format(td.text))
            elif ind == 3: #Chinese name
                astock["company_name"] = td.text
                break
            ind += 1
    
        #print(astock)
        Company().add(astock)

    return comp_links

The primary use of selinum and bf appears above, which is relatively simple. The whole process is not automatic. It needs to obtain the data and observe and analyze it again. If the data is incorrect or abnormal, stop the program immediately, and then modify the parameters to continue.

Company details data acquisition

#Query all the details that have not been filled in, continue to fill in
def fillExtend():
    stocks = Company().GetUnFill()
    driver = webdriver.Chrome("./lib/chromedriver.exe")
    url = "http://stockpage.10jqka.com.cn/{0}/company/"
    for code in stocks:
        _snatchDetail(driver,url.format(code))

#Grab supplementary information from details page
def _snatchDetail(driver,link):
    m = re.search(r"\d{6}",link)
    comp = {"code":m.group()}
    driver.get(link)
    try:
        driver.switch_to.frame("dataifm")
    except Exception as ex:
        print("cannot found frame:",comp)
        return
    htmlb = driver.find_element_by_css_selector(".m_tab_content2").get_attribute("innerHTML")
    bf = BeautifulSoup(htmlb,"html.parser")
    strongs = bf.select("tr>td>span")
    comp["main_yewu"] = strongs[0].text
    comp["location"] = strongs[-1].text

    driver.switch_to.parent_frame()
    driver.switch_to.frame("ifm")
    time.sleep(3)
    htmla = driver.find_element_by_css_selector("ul.new_trading").get_attribute("innerHTML")
    bf = BeautifulSoup(htmla,"html.parser")
    _getvalues(bf,comp)
    #print("list.py comp:",comp)
    Company().update(comp)

    time.sleep(10)

def _getvalues(bf,comp):
    strongs = bf.select("li span strong")
    comp["total_value"] = strongs[7].text
    comp["flut_value"] = strongs[10].text 
    comp["clean_value"] = strongs[8].text 
    profit = strongs[11].text
    if profit == "loss":
        profit = -1.0
    comp["profit_value"] = profit
    

These two lines need to be noted.
driver.switch_to.parent_frame()
driver.switch_to.frame("ifm")
When searching elements, you need to pay attention to whether there is iframe on the page. If there is, you should first jump the driver to the corresponding frame, and the idea is the same as using document on the front end

Weekly data acquisition

#Weekly data acquisition
import urllib.request
import time
import re
import os
import json
from lwy.stock.dao.company import Company
from lwy.stock.dao.weekline import WeekLine

def GetWeekLine():
    codes = Company().PageCode("600501",1000)

    url = "http://d.10jqka.com.cn/v6/line/hs_{0}/11/all.js"
    header = [("Referer", "http://stockpage.10jqka.com.cn/HQ_v4.html"),
        ("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36")]
    for code in codes:
        print("code:",url.format(code))
        opener = urllib.request.build_opener()
        opener.addheaders = header
        
        with opener.open(url.format(code)) as resp:
            content = resp.read().decode()
        m = re.search(r"{.*}",content)
        if m is None:
            print("not found: ",code)
        else:
            with open("./weeks/{0}.json".format(code),"w",encoding="utf-8") as wfile:
                wfile.write(m.group())

        time.sleep(10)


#Analyze perimeter from json file
def ana_weekline():
    #Traverse file directory
    files = os.listdir("./weeks")
    for file in files:
        fname = "./weeks/"+file
        if os.path.isfile(fname):
            bsname = file[0:6]
            with open(fname,encoding="utf-8") as rfile:
                content = rfile.read()
                _withJSON(bsname,json.loads(content))
                #After success, you need to remove the json file to another directory
                #os.rename(file,file+"_old")
            #break #Analyze one stop
    pass

def WeekTest():
    with open("./weeks/002774.json",encoding="utf-8") as rfile:
        content = rfile.read()
        _withJSON("002774",json.loads(content))

def _withJSON(scode,jdata):
    dates = jdata["dates"].split(',')
    prices = jdata["price"].split(",")
    myears = jdata["sortYear"]
    #Data examples of up to 4 years, years and weeks are as follows [[2017,40], [2018,51]]
    if len(myears)>4: #Only four years for long
        myears = myears[-4:]
    preyear = [] #Year header, which holds the year headers of all the perimeters of the last 4 years
    for item in myears:
        y = item[0]
        num = item[1]
        preyear.extend( [y for i in range(num)])
    #Both price data and log data should cycle from the end
    #print("preyear:",preyear)
    week = len(preyear)
    while week >0:
        ind_week = -1*week
        #The following four values are combined into one week data: low, open, high, close
        ind_price = -4*week
        #The following three data are obtained: open, close, fluctuation and full name
        kai = float(prices[ind_price])+float(prices[ind_price+1])
        shou = float(prices[ind_price]) +float(prices[ind_price+3])
        wave = (shou-kai)*100/kai   #Fluctuation in percentage
        wfull = str(preyear[ind_week]) + dates[ind_week]
        week -= 1
        #Note that wave is a fluctuation, and the rise and fall should be compared with yesterday's data, rather than today's data. Wave seems meaningless
        #print("{0}: on -- {1}, close -- {2}, fluctuate -- {3:.2f}".format(wfull,kai,shou,wave))
        #Order: stock UU code, week, start UU value, end UU value, wave UU value
        wl = (scode,wfull,kai,shou,wave)

        WeekLine().AddOne(wl)

Weekly data is actually json data returned by requesting a js and saving it. Then read and analyze the files one by one.

Quarterly report data acquisition

import time
from selenium import webdriver
from bs4 import BeautifulSoup
from lwy.stock.dao.company import Company
from lwy.stock.dao.reports import SeasonReport

driver = webdriver.Chrome("./lib/chromedriver.exe") 

#Open the interface to the public and crawl the Quarterly Report
def spideSeason():
#Get stock code by batch and cycle
    codes = Company().PageCode("002114",1000)
    for code in codes:
        print("now get code is :",code)
        content = _fromHttp(code)
        if content == "":
            continue
        _anaReport(content,code)
        time.sleep(10)
    

def _anaReport(content, code):
    bf = BeautifulSoup(content,"html.parser")
    divs = bf.find("div",id="data-info").find_next_sibling().select("div.td_w")
    seasons = []
    #Up to 16 quarters, if not enough, the number of quarters in the data sheet shall prevail
    sealen = 0
    for div in divs:
        if sealen >=16:
            break
        seasons.append(div.text)
        sealen+=1
    
    keymap = {"3":"total_profit","4":"profit_ratio","5":"total_income","6":"income_ratio","9":"clean_ratio",10:"debt_ratio"}
    trs = bf.select("table.tbody > tbody > tr")
    reports = [ {"season":x} for x in seasons ]
    #print("reports:",reports)
    for ind,keyname in keymap.items():
        #Index corresponding description 3: deduct non net profit, 4: deduct non net profit growth rate, 5 total revenue, 6 revenue growth rate, 9 return on net assets, 10 debt ratio
        tds = trs[int(ind)].find_all("td")
        for tdindex in range(0,sealen):
            text = tds[tdindex].text
            if "%" in text:
                text = text.replace("%","")
            elif "Billion" in text:
                text = text.replace("Billion","")
            elif "ten thousand" in text:
                f = float(text.replace("ten thousand",""))
                text = "{0:.4f}".format(f/10000.0)
            reports[tdindex][keyname] = text
    for r in reports:
        r["stock_code"] = code
        #The net profit or total operating revenue is blank at the same time and no record is made
        if r["total_income"] == "" or r["total_income"] == "":
            continue
        #print(r)
        SeasonReport().add(r)

def _fromHttp(scode):
    global driver
    driver.get("http://stockpage.10jqka.com.cn/{0}/finance/#finance".format(scode))
    time.sleep(3)
    try:
        driver.switch_to_frame("dataifm")
    except:
        return ""
    #Locate the li for the quarterly report and click
    tab3 = driver.find_element_by_css_selector("ul.tabDataTab").find_element_by_link_text("Quarterly basis")
    tab3.click()
    time.sleep(1)
    content = driver.find_element_by_css_selector("div.data_tbody").get_attribute("innerHTML")
    with open("./reports/{0}.html".format(scode),"w",encoding="utf-8") as wfile:
        wfile.write(content)

    return content

The initial startup function of quarterly report data writes a stock code, which is obtained through database query, because the data acquisition is basically processed by increasing the stock code.
The financial report data is divided into multiple tabs by reporting period, quarterly report and annual report. Here, through tab 3 = driver.find_element_by css_selector ("UL. Tabdatatab"). Find_element_by link_text ("by single quarter")
tab3.click()
time.sleep(1)
Switch. It's a personal habit to sleep for 1 millisecond. I always like to wait for an operation. I don't find out if it's meaningful

Several ideas

1: control request frequency. Flush page requests should be frequency limited. If the request is too fast, it will jump to the following page[ http://stockpage.10jqka.com.cn/ ]In this paper, the frequency is almost a critical value, more will automatically jump.

2: step by step and step by step. The data acquisition itself is gradually improved. The data source seems to have a unified specification, but it is not. For example, the net profit in the quarterly report. Originally, you designed the data type to be floating-point, but there are individual '-' in the paper. All of these may lead to data loss, exception or entry error. It's not realistic to expect the one-time automatic acquisition to be completed, but once the error occurs, it will be completely retrieved, waste a lot of requests, and may be blocked. So the best way is to acquire the data of the first layer, check and confirm, and then continue to acquire the next layer. In this way, step by step, log and analyze to which one, and then analyze again from the abnormal place

3: when you encounter a pit, you can walk around. This may not be a positive attitude, but sometimes it's very useful. It takes too much time to fill the hole. When learning a content, it's impossible to make it clear in an all-round way. It's easy to have blind spots or can't find a solution for a while. At this time, pause for a while, think about whether it must be done. Is there any other way

Data cleaning

A professional name may be data cleaning
A small amount of captured data has no reference value. In order to reduce its negative impact, it needs to filter or supplement, for example:
Just or less than one year
Incomplete quarterly data or income and profit information in the quarterly report is "-"
Long suspended
Only select the address of the company as a big city, especially excluding the company's headquarters in the third and fourth tier small cities (such companies' management ability, interest disputes, insider trading and other non operating factors have a greater impact)
......

About 42w weekly fluctuation data, 4w quarterly report data and 2k + basic data of listed companies were obtained

Data analysis, retired

Perhaps the above data can be obtained through a special financial data interface without careful study, but this article, as an example of the use of selinum and beutifulsoap, is a little interesting. However, the purpose of data acquisition is to analyze, and my original intention is to base on the quarterly operating data and weekly growth of listed companies in the past Down, to predict the future of the stock up and down.

However, based on my advanced mathematics knowledge more than ten years ago, and continuous degradation and forgetting, it is difficult to find a calculation model to fit the past and predict the future. If any student has relevant experience, he can point out a direction, and if he can give a specific example (blog address is OK), it will be better.
Thank you for your reply in the comments!

Posted by haku on Tue, 26 Nov 2019 20:18:47 -0800