This article is the fourth in a series of articles "from introduction to persuasion". It can also be used as the previous one
Puppeter application
Follow up.
Readers of this article: primary users of python, students who want to learn about crawlers or data grabbing. Want to know the users of selinum and beiotifulsoup
Background:
python is good at data processing. There are some excellent databases such as numpy and pandas. Let's do an example experiment. I'm interested in the economy. So I'll take stock market data analysis and through the statistical analysis of historical data to see whether I can find out which indicators of a listed company are the most influential factors to determine its stock trend.
Then, where does the data come from? I'll grab it from the Internet. Then I compare the convenience of data acquisition on Tencent stock channel and Dongfang fortune. I choose the data source of tonghuashun, get the data through the selinum request, and get the desired words through the dom of the beautifalsoup Analysis page. Let's start happily
Data acquisition process
step1: get the paging list of all stocks, and extract the two basic information of stock code and stock Chinese name in each line.
Step 2: some indicators are not in the list, so we need to request the company details page of each stock to extract the main business, region, total market value, current market value, P / E ratio and P / N ratio. The above information is put into storage to form a basic company information table
Step 3: get the quarterly report information of each stock, and make the quarterly report for storage.
Step 4: get the weekly data of each stock, and make the weekly up and down table for storage. Considering that the daily data fluctuation is more occasional and there is no lack of lines, the daily up and down data is not selected for warehousing. If the monthly data warehousing time span is too long
code analysis
Corresponding to the above four steps of data acquisition process, the following is divided into four code blocks
List data acquisition and analysis
The analysis is the linked data List of listed companies
import time import re from selenium import webdriver from bs4 import BeautifulSoup from lwy.stock.dao.company import Company #Shanghai a, Shenzhen A and Shenzhen SME board SHA = "http://q.10jqka.com.cn/index/index/board/hs/field/zdf/order/desc/page/{0}/ajax/1/" SZA = "http://q.10jqka.com.cn/index/index/board/ss/field/zdf/order/desc/page/{0}/ajax/1/" SZZX = "http://q.10jqka.com.cn/index/index/board/zxb/field/zdf/order/desc/page/{0}/ajax/1/" #Combination acquisition, return all stock data def getAllStock(): #pageOne(SZA, 71) #pageOne(SZA, 24) pageOne(SZZX, 1) #Cycle to get data by page def pageOne(url,pagenum): driver = webdriver.Chrome("./lib/chromedriver.exe") detail_links = [] for page in range(5,pagenum): print("now pagenum is : ",page) driver.get(url.format(page)) detail_links = anaList(driver.page_source) time.sleep(15) #break #One page first #Loop list connection to get all company details and update #for link in detail_links: # _snatchDetail(driver,link) driver.quit() #HTML STR obtained by using bs analysis def anaList(htmlstr): bf = BeautifulSoup(htmlstr,"html.parser") trs = bf.select("tbody tr") #Company details link comp_links = [] #trs = bf.find("tbody").children for tr in trs: #14 elements in total astock = {} ind = 1 #print("tr:",tr) tds = tr.find_all("td") for td in tds: if ind == 2: #gp code astock["stock_code"] = td.text comp_links.append("http://stockpage.10jqka.com.cn/{0}/company/".format(td.text)) elif ind == 3: #Chinese name astock["company_name"] = td.text break ind += 1 #print(astock) Company().add(astock) return comp_links
The primary use of selinum and bf appears above, which is relatively simple. The whole process is not automatic. It needs to obtain the data and observe and analyze it again. If the data is incorrect or abnormal, stop the program immediately, and then modify the parameters to continue.
Company details data acquisition
#Query all the details that have not been filled in, continue to fill in def fillExtend(): stocks = Company().GetUnFill() driver = webdriver.Chrome("./lib/chromedriver.exe") url = "http://stockpage.10jqka.com.cn/{0}/company/" for code in stocks: _snatchDetail(driver,url.format(code)) #Grab supplementary information from details page def _snatchDetail(driver,link): m = re.search(r"\d{6}",link) comp = {"code":m.group()} driver.get(link) try: driver.switch_to.frame("dataifm") except Exception as ex: print("cannot found frame:",comp) return htmlb = driver.find_element_by_css_selector(".m_tab_content2").get_attribute("innerHTML") bf = BeautifulSoup(htmlb,"html.parser") strongs = bf.select("tr>td>span") comp["main_yewu"] = strongs[0].text comp["location"] = strongs[-1].text driver.switch_to.parent_frame() driver.switch_to.frame("ifm") time.sleep(3) htmla = driver.find_element_by_css_selector("ul.new_trading").get_attribute("innerHTML") bf = BeautifulSoup(htmla,"html.parser") _getvalues(bf,comp) #print("list.py comp:",comp) Company().update(comp) time.sleep(10) def _getvalues(bf,comp): strongs = bf.select("li span strong") comp["total_value"] = strongs[7].text comp["flut_value"] = strongs[10].text comp["clean_value"] = strongs[8].text profit = strongs[11].text if profit == "loss": profit = -1.0 comp["profit_value"] = profit
These two lines need to be noted.
driver.switch_to.parent_frame()
driver.switch_to.frame("ifm")
When searching elements, you need to pay attention to whether there is iframe on the page. If there is, you should first jump the driver to the corresponding frame, and the idea is the same as using document on the front end
Weekly data acquisition
#Weekly data acquisition import urllib.request import time import re import os import json from lwy.stock.dao.company import Company from lwy.stock.dao.weekline import WeekLine def GetWeekLine(): codes = Company().PageCode("600501",1000) url = "http://d.10jqka.com.cn/v6/line/hs_{0}/11/all.js" header = [("Referer", "http://stockpage.10jqka.com.cn/HQ_v4.html"), ("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36")] for code in codes: print("code:",url.format(code)) opener = urllib.request.build_opener() opener.addheaders = header with opener.open(url.format(code)) as resp: content = resp.read().decode() m = re.search(r"{.*}",content) if m is None: print("not found: ",code) else: with open("./weeks/{0}.json".format(code),"w",encoding="utf-8") as wfile: wfile.write(m.group()) time.sleep(10) #Analyze perimeter from json file def ana_weekline(): #Traverse file directory files = os.listdir("./weeks") for file in files: fname = "./weeks/"+file if os.path.isfile(fname): bsname = file[0:6] with open(fname,encoding="utf-8") as rfile: content = rfile.read() _withJSON(bsname,json.loads(content)) #After success, you need to remove the json file to another directory #os.rename(file,file+"_old") #break #Analyze one stop pass def WeekTest(): with open("./weeks/002774.json",encoding="utf-8") as rfile: content = rfile.read() _withJSON("002774",json.loads(content)) def _withJSON(scode,jdata): dates = jdata["dates"].split(',') prices = jdata["price"].split(",") myears = jdata["sortYear"] #Data examples of up to 4 years, years and weeks are as follows [[2017,40], [2018,51]] if len(myears)>4: #Only four years for long myears = myears[-4:] preyear = [] #Year header, which holds the year headers of all the perimeters of the last 4 years for item in myears: y = item[0] num = item[1] preyear.extend( [y for i in range(num)]) #Both price data and log data should cycle from the end #print("preyear:",preyear) week = len(preyear) while week >0: ind_week = -1*week #The following four values are combined into one week data: low, open, high, close ind_price = -4*week #The following three data are obtained: open, close, fluctuation and full name kai = float(prices[ind_price])+float(prices[ind_price+1]) shou = float(prices[ind_price]) +float(prices[ind_price+3]) wave = (shou-kai)*100/kai #Fluctuation in percentage wfull = str(preyear[ind_week]) + dates[ind_week] week -= 1 #Note that wave is a fluctuation, and the rise and fall should be compared with yesterday's data, rather than today's data. Wave seems meaningless #print("{0}: on -- {1}, close -- {2}, fluctuate -- {3:.2f}".format(wfull,kai,shou,wave)) #Order: stock UU code, week, start UU value, end UU value, wave UU value wl = (scode,wfull,kai,shou,wave) WeekLine().AddOne(wl)
Weekly data is actually json data returned by requesting a js and saving it. Then read and analyze the files one by one.
Quarterly report data acquisition
import time from selenium import webdriver from bs4 import BeautifulSoup from lwy.stock.dao.company import Company from lwy.stock.dao.reports import SeasonReport driver = webdriver.Chrome("./lib/chromedriver.exe") #Open the interface to the public and crawl the Quarterly Report def spideSeason(): #Get stock code by batch and cycle codes = Company().PageCode("002114",1000) for code in codes: print("now get code is :",code) content = _fromHttp(code) if content == "": continue _anaReport(content,code) time.sleep(10) def _anaReport(content, code): bf = BeautifulSoup(content,"html.parser") divs = bf.find("div",id="data-info").find_next_sibling().select("div.td_w") seasons = [] #Up to 16 quarters, if not enough, the number of quarters in the data sheet shall prevail sealen = 0 for div in divs: if sealen >=16: break seasons.append(div.text) sealen+=1 keymap = {"3":"total_profit","4":"profit_ratio","5":"total_income","6":"income_ratio","9":"clean_ratio",10:"debt_ratio"} trs = bf.select("table.tbody > tbody > tr") reports = [ {"season":x} for x in seasons ] #print("reports:",reports) for ind,keyname in keymap.items(): #Index corresponding description 3: deduct non net profit, 4: deduct non net profit growth rate, 5 total revenue, 6 revenue growth rate, 9 return on net assets, 10 debt ratio tds = trs[int(ind)].find_all("td") for tdindex in range(0,sealen): text = tds[tdindex].text if "%" in text: text = text.replace("%","") elif "Billion" in text: text = text.replace("Billion","") elif "ten thousand" in text: f = float(text.replace("ten thousand","")) text = "{0:.4f}".format(f/10000.0) reports[tdindex][keyname] = text for r in reports: r["stock_code"] = code #The net profit or total operating revenue is blank at the same time and no record is made if r["total_income"] == "" or r["total_income"] == "": continue #print(r) SeasonReport().add(r) def _fromHttp(scode): global driver driver.get("http://stockpage.10jqka.com.cn/{0}/finance/#finance".format(scode)) time.sleep(3) try: driver.switch_to_frame("dataifm") except: return "" #Locate the li for the quarterly report and click tab3 = driver.find_element_by_css_selector("ul.tabDataTab").find_element_by_link_text("Quarterly basis") tab3.click() time.sleep(1) content = driver.find_element_by_css_selector("div.data_tbody").get_attribute("innerHTML") with open("./reports/{0}.html".format(scode),"w",encoding="utf-8") as wfile: wfile.write(content) return content
The initial startup function of quarterly report data writes a stock code, which is obtained through database query, because the data acquisition is basically processed by increasing the stock code.
The financial report data is divided into multiple tabs by reporting period, quarterly report and annual report. Here, through tab 3 = driver.find_element_by css_selector ("UL. Tabdatatab"). Find_element_by link_text ("by single quarter")
tab3.click()
time.sleep(1)
Switch. It's a personal habit to sleep for 1 millisecond. I always like to wait for an operation. I don't find out if it's meaningful
Several ideas
1: control request frequency. Flush page requests should be frequency limited. If the request is too fast, it will jump to the following page[ http://stockpage.10jqka.com.cn/ ]In this paper, the frequency is almost a critical value, more will automatically jump.
2: step by step and step by step. The data acquisition itself is gradually improved. The data source seems to have a unified specification, but it is not. For example, the net profit in the quarterly report. Originally, you designed the data type to be floating-point, but there are individual '-' in the paper. All of these may lead to data loss, exception or entry error. It's not realistic to expect the one-time automatic acquisition to be completed, but once the error occurs, it will be completely retrieved, waste a lot of requests, and may be blocked. So the best way is to acquire the data of the first layer, check and confirm, and then continue to acquire the next layer. In this way, step by step, log and analyze to which one, and then analyze again from the abnormal place
3: when you encounter a pit, you can walk around. This may not be a positive attitude, but sometimes it's very useful. It takes too much time to fill the hole. When learning a content, it's impossible to make it clear in an all-round way. It's easy to have blind spots or can't find a solution for a while. At this time, pause for a while, think about whether it must be done. Is there any other way
Data cleaning
A professional name may be data cleaning
A small amount of captured data has no reference value. In order to reduce its negative impact, it needs to filter or supplement, for example:
Just or less than one year
Incomplete quarterly data or income and profit information in the quarterly report is "-"
Long suspended
Only select the address of the company as a big city, especially excluding the company's headquarters in the third and fourth tier small cities (such companies' management ability, interest disputes, insider trading and other non operating factors have a greater impact)
......
About 42w weekly fluctuation data, 4w quarterly report data and 2k + basic data of listed companies were obtained
Data analysis, retired
Perhaps the above data can be obtained through a special financial data interface without careful study, but this article, as an example of the use of selinum and beutifulsoap, is a little interesting. However, the purpose of data acquisition is to analyze, and my original intention is to base on the quarterly operating data and weekly growth of listed companies in the past Down, to predict the future of the stock up and down.
However, based on my advanced mathematics knowledge more than ten years ago, and continuous degradation and forgetting, it is difficult to find a calculation model to fit the past and predict the future. If any student has relevant experience, he can point out a direction, and if he can give a specific example (blog address is OK), it will be better.
Thank you for your reply in the comments!