[python reptile] crawl disease database

Keywords: Big Data Database encoding

Database address: http://web.tfrd.org.tw/genehelp/diseaseDatabase.html?selectedIndex=0

The database looks like this:

This time, we mainly crawl the name of the disease. The difficulty is that the source code of the web page can not see the data, but we can view the source website of the web page request data through the F12 developer tool.

 

You can see the address of the request URL and open it to see:

 

Most of the text is the name of the disease, so it's easy to crawl it.

First of all, the Chinese characters in the source code should be regarded as the name of the disease! __________

import requests
import bs4
from bs4 import BeautifulSoup

url= 'http://web.tfrd.org.tw/genehelpDB/GeneHelp/DiseaseDBIndex/'
path= r'C:\Users\Xie Yingchao\Desktop\download\disease.txt'
urls={'0': 'A', '1': 'B', '2': 'C', '3': 'D', '4': 'E', '5': 'F', '6': 'G', '7': 'H',
      '8': 'I', '9':'J', '10': 'K', '11': 'L', '12': 'M', '13': 'N', '14': 'O',
      '15': 'P', '16': 'Q', '17': 'R', '18': 'S', '19': 'T', '20': 'U', '21': 'V', '22': 'W',
      '23': 'X', '24': 'Y', '25': 'Z', '26': 'CD'
}#For better web sites, set up a dictionary, although it's a bit silly.
def GetText(url):#get Web page source code
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""
def main(url):
    r= GetText(url)
    soup= BeautifulSoup(r, 'html.parser')
    pstr= list(soup)[0]
    #Converting HTML into a list, list(soup)[0] is the first element, and you can find that it's the source code of the page, which is a bit silly.
    plist=pstr[16:-1].split('}')
    #By intercepting the elements in'['and']' in the source code, we can see from the source code that every disease is wrapped up by'{}', so we divide it into a list with'}'.
    #So every element in plist covers a disease (Chinese string)
    # It's foolish to assign a list to a string and then divide the string into lists.
    k=0
    for i in range(len(plist)-1):
        k= k+1
        pplist= plist[i].split('"')
        #It can be seen that the name of the disease is surrounded by "sum", so each element of plist is re-divided by "split"
        #Where pplist[7] is the name of the disease (Chinese string)
        p= pplist[7].find('[')
        #In fact, it can be found that several Chinese strings are not disease names, which can be filtered out by "[symbols].
        if p==-1:
            print(pplist[7])
            Write2txt(pplist[7])
    print('\n Number of diseases:%d'%k)

def Write2txt(text):#File Writing
    f= open(path,'a')
    f.write(text)
    f.writelines('\n')
    f.close()

for j in range(26):#The dictionary's function is here!!!
    url1= url+str(urls[str(j)])
    print('%s:'%urls[str(j)])
    main(url1)

The code will be finished, written casually, the method used is a bit silly, you will see.

Posted by Patrick on Thu, 31 Jan 2019 15:39:15 -0800