Database address: http://web.tfrd.org.tw/genehelp/diseaseDatabase.html?selectedIndex=0
The database looks like this:
This time, we mainly crawl the name of the disease. The difficulty is that the source code of the web page can not see the data, but we can view the source website of the web page request data through the F12 developer tool.
You can see the address of the request URL and open it to see:
Most of the text is the name of the disease, so it's easy to crawl it.
First of all, the Chinese characters in the source code should be regarded as the name of the disease! __________
import requests import bs4 from bs4 import BeautifulSoup url= 'http://web.tfrd.org.tw/genehelpDB/GeneHelp/DiseaseDBIndex/' path= r'C:\Users\Xie Yingchao\Desktop\download\disease.txt' urls={'0': 'A', '1': 'B', '2': 'C', '3': 'D', '4': 'E', '5': 'F', '6': 'G', '7': 'H', '8': 'I', '9':'J', '10': 'K', '11': 'L', '12': 'M', '13': 'N', '14': 'O', '15': 'P', '16': 'Q', '17': 'R', '18': 'S', '19': 'T', '20': 'U', '21': 'V', '22': 'W', '23': 'X', '24': 'Y', '25': 'Z', '26': 'CD' }#For better web sites, set up a dictionary, although it's a bit silly. def GetText(url):#get Web page source code try: r = requests.get(url) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: return "" def main(url): r= GetText(url) soup= BeautifulSoup(r, 'html.parser') pstr= list(soup)[0] #Converting HTML into a list, list(soup)[0] is the first element, and you can find that it's the source code of the page, which is a bit silly. plist=pstr[16:-1].split('}') #By intercepting the elements in'['and']' in the source code, we can see from the source code that every disease is wrapped up by'{}', so we divide it into a list with'}'. #So every element in plist covers a disease (Chinese string) # It's foolish to assign a list to a string and then divide the string into lists. k=0 for i in range(len(plist)-1): k= k+1 pplist= plist[i].split('"') #It can be seen that the name of the disease is surrounded by "sum", so each element of plist is re-divided by "split" #Where pplist[7] is the name of the disease (Chinese string) p= pplist[7].find('[') #In fact, it can be found that several Chinese strings are not disease names, which can be filtered out by "[symbols]. if p==-1: print(pplist[7]) Write2txt(pplist[7]) print('\n Number of diseases:%d'%k) def Write2txt(text):#File Writing f= open(path,'a') f.write(text) f.writelines('\n') f.close() for j in range(26):#The dictionary's function is here!!! url1= url+str(urls[str(j)]) print('%s:'%urls[str(j)]) main(url1)
The code will be finished, written casually, the method used is a bit silly, you will see.