Recently, due to personal needs, self-study crawlers from related books and online materials. The target website is http://mzj.beijing.gov.cn. The content is sorted and screened and stored in excel format.
First, the content of the table is set. The encoding format is defined as utf-8. Add a sheet table, where the head is the content of the header. After the definition, use sheet.write to write the content of the header.
book = xlwt.Workbook(encoding='utf-8') sheet = book.add_sheet('ke_qq') head = ['Organization name','Registration Certificate No.','Unified social credit code','Business competent unit','Registration authority','Types of social organizations','Start-up funds','Scope of business','Legal representative','Telephone','address','Zip code','Registration status','date of establishment','Industry classification']#Header for h in range(len(head)): sheet.write(0,h,head[h]) #Writing table
Crawling web pages are accessed by requests and parsed by beautiful soup.
response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser', from_encoding='utf-8')
After that, we extract the effective fields in the web page content, and use soup.stripped'strings to remove the blank space and blank line content.
str1 = [] nice = [] for wz in soup.stripped_strings: str1.append(repr(wz)) k = len(str1)
Finally, according to the different needs of each person, the data is sorted out. Here, insert, pop and append are used to adjust the data.
The complete code is as follows:
# coding:utf-8 import requests from bs4 import BeautifulSoup import operator as op import re import xlwt user_agent = 'Mozilla/4.0 (compatible;MSIE5.5;windows NT)' headers = {'User-Agent': user_agent} num=1 book = xlwt.Workbook(encoding='utf-8') sheet = book.add_sheet('ke_qq') head = ['Organization name','Registration Certificate No.','Unified social credit code','Business competent unit','Registration authority','Types of social organizations','Start-up funds','Scope of business','Legal representative','Telephone','address','Zip code','Registration status','date of establishment','Industry classification']#Header for h in range(len(head)): sheet.write(0,h,head[h]) #Writing table for one in range(10001,17000): keyword = 10000000001 keywords=keyword+one url = 'http://mzj.beijing.gov.cn/wssbweb/wssb/dc/orgInfo.do?action=seeParticular&orgId=0000' + str(keywords) + '&websitId=&netTypeId=' response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser', from_encoding='utf-8') str1 = [] nice = [] for wz in soup.stripped_strings: str1.append(repr(wz)) k = len(str1) if k>5: i = 1 for content in str1: if i > 3: nice.append(content) i = i + 1 try: # num=num+1 if op.eq(nice[4], '\'Competent business unit:\''): nice.insert(4, 'nothing') if op.eq(nice[14], '\'Legal representative/Person in charge:\''): nice.insert(14, 'nothing') if op.eq(nice[13], '\'Activity area:\''): nice.pop(13) nice.pop(13) if op.eq(nice[16], '\'Telephone:\''): nice.insert(16, 'nothing') if op.eq(nice[18], '\'Address:\''): nice.insert(18, 'nothing') if op.eq(nice[20], '\'Zip code:\''): nice.insert(20, 'nothing') if len(nice)>22: if op.eq(nice[22], '\'Registration status:\''): nice.insert(22, 'nothing') if len(nice) > 27: if op.eq(nice[27], '\'Industry Classification:\'') and len(nice) == 28: nice.append('nothing') # if op.eq(nice[13], '\' active region: \ '): # nice.pop(13) # nice.pop(13) if op.eq(nice[12], '\'element\''): nice[12] = '0' # print(nice) j = 0 d = 0 s = 0 for data in nice: if j & 1 == 0: s = j - d sheet.write(num, s, data) d += 1 j += 1 print(num) num += 1 except: print('error'+num) book.save('E:\WU\pyfile\shuju\save2\shuju2.xls')
Because of different crawling web pages, different methods may be adopted for keyword in web page address.