Python Crawler--Disease Information Crawling

I. Crawling Contents and Websites

1. The website we crawled is the disease information in the 99 Health Network, including the name of the disease, the Department and the body feature information.
2. There are two ways to search for specific diseases in websites: searching by department and searching by location.
(1) Search by Departments: Diseases in Departments - Small Departments
(2) Searching by Location: Location - Diseases
It can be seen that the path of searching by location is shorter, so this time we use the process of searching by location to crawl disease information.
3. Integral crawling idea: crawl the part contained - - crawl the disease name and disease-related links contained in each part - - crawl the disease-related information - - write the form and save it.
4. Web Site Characteristics
(1) The corresponding website of location information is jbk.99.com.cn/buei./
(2) Climbing to all parts ['head',... ]
(3) Diseases in a certain part: jbk.99.com.cn/buwei/toubu-n.html (only need to add Pinyin and uniform suffix. html), crawl all diseases in this part, and the corresponding link prefix in the source code of the web page is also crawled and saved. - In n, n is the page number

4. Specific links for diseases are https + disease links climbed in the previous step + zhengzhuang.html

2. Crawling process (source code)

// An highlighted block
#Website
buwei_url1 = "https://jbk.99.com.cn/"
buwei_url2 = ".html"  

#Climbing location information https://jbk.99.com.cn/location/
buwei_url=buwei_url1+'buwei/'
try:
    buwei_urlop = urllib.request.urlopen(buwei_url,timeout=100)
except Exception:
    print("overtime")
data_buwei=buwei_urlop.read().decode('GBK')#!!!!! The original web page coding method is GBK
#Get the location name and save it in the list ls_buwei
link_buwei="<div class=\"part-txt\">\\s*<h3>\\s*<span>\\s*(.*?)</span>"
ls_buwei=[]
patten=re.compile(link_buwei)
ls_buwei=patten.findall(data_buwei)
2.Climb the disease names and links according to the location, and then click the specific information of the disease.
#New Form Text
book=xlwt.Workbook(encoding='utf-8')
heads=[u'Name of disease',u'link',u'Location of onset',u'Consulting Departments',u'Somatotype Characteristics',u'Symptom']
    
#Get the names of diseases in different parts
for i,bw in enumerate(ls_buwei):
    catch=1#Replacement of position switch
    ii=0
    #Diseases in the same location are placed in a sheet
    sheet=book.add_sheet(bw,cell_overwrite_ok=True)
    for j,head in enumerate(heads):
        sheet.write(0,j,head)
        
    while catch==1:
        ii+=1
        #Location Web Site
        buwei_url_s=buwei_url+hp(bw)+'-'+str(ii)+buwei_url2
        print(buwei_url_s)
        try:
            #Grab the Web Page
            buwei_urlop_s=urllib.request.urlopen(buwei_url_s,timeout=100)
        except Exception:
            print('overtime')
            continue
        try:
            data_buwei=buwei_urlop_s.read().decode('GBK')
            #Web source code crawling
        except:
            continue
        #Determine whether it's the last page and compare the source code on the last page.
        link_is_msg="<div class=\"part-cont3\">\\s*(.*?)<div class=\"digg\">"
        patten_is_msg=re.compile(link_is_msg)
        is_msg=patten_is_msg.findall(data_buwei)
        #If it's the last page, switch to the next part
        if is_msg==['']:
            catch=0
            continue

        #Get all disease names and links on this page, and save the results in the list ls_des
        link_buwei= "<dd>\\s*<h3>\\s*<span>\\s*</span>\\s*<a href=\"(.*?)\"\\s*title=\".+\" target=\"_blank\">(.*?)</a></h3>"
        patten_buwei=re.compile(link_buwei)
        ls_des=patten_buwei.findall(data_buwei)
        #Grasping the specific information of diseases in ls_des
        iii=0
        for x in ls_des:
            t1=time.clock()
            iii+=1
            row=(ii-1)*10+iii
            print(row)
            #Write disease names and links to tables
            sheet.write(row,0,x[1])
            sheet.write(row,1,x[0])
            #Disease Links
            jibing_url='https:'+x[0]+'zhengzhuang.html'
            print(jibing_url)
            try:
                jibing_urlop=urllib.request.urlopen(jibing_url,timeout=100)
            except Exception:
                print('overtime')
                continue
            try:
                #Source code acquisition
                data_jibing=jibing_urlop.read().decode('GBK')
            except:
                continue
            #Information on the location of the disease
            link_fbbw="<li><font>Location of onset:(.*?)</a></li>"
            patten_fbbw=re.compile(link_fbbw)
            st_abbw=patten_fbbw.findall(data_jibing)
            if st_abbw==[]:
                continue
            hz_fbbw=re.findall('[\u4e00-\u9fa5]+',st_abbw[0])#Chinese Extraction
            fbbw=','.join(ff for ff in hz_fbbw )
            sheet.write(row,2,fbbw)
            #Grasp and Write Information of Medical Departments
            link_jzks="<li><font>Visiting Departments:(.*?)</a></li>"
            patten_jzks=re.compile(link_jzks)
            st_jzks=patten_jzks.findall(data_jibing)
            jzks=','.join(ff for ff in re.findall('[\u4e00-\u9fa5]+',st_jzks[0]))#Chinese Extraction
            sheet.write(row,3,jzks)
            #Grasping Stylistic Features
            link_tztz="<dt><a>Symptoms and signs:</a></dt>\\s*<dd>\\s*<ul>(.*?)</a></li></ul>"
            patten_tztz=re.compile(link_tztz)
            st_tztz=patten_tztz.findall(data_jibing)
            tztz=','.join(ff for ff in re.findall('[\u4e00-\u9fa5]+',st_tztz[0])) if st_tztz!=[] else 0
            sheet.write(row,4,tztz)
            iii+=1
            t2=time.clock()
            print(t2-t1)      
book.save("jibing.xls")#Save the form

Result presentation

Matters needing attention

1. Coding Conversion
View the original page coding format, if utf-8 can be used directly, if not, using str.decode('original page coding format') to convert to
2. Chinese Extraction
The Chinese range of unicode codes is_ 4e00- 9fa5

Posted by imstupid on Fri, 16 Aug 2019 02:39:08 -0700

Programmer Group

Python Crawler--Disease Information Crawling

I. Crawling Contents and Websites

2. Crawling process (source code)

Result presentation

Matters needing attention

Hot Keywords