Crawler - crawling web page data into form

Keywords: encoding Excel Windows

Recently, due to personal needs, self-study crawlers from related books and online materials. The target website is http://mzj.beijing.gov.cn. The content is sorted and screened and stored in excel format.

First, the content of the table is set. The encoding format is defined as utf-8. Add a sheet table, where the head is the content of the header. After the definition, use sheet.write to write the content of the header.

book = xlwt.Workbook(encoding='utf-8')
sheet = book.add_sheet('ke_qq')
head = ['Organization name','Registration Certificate No.','Unified social credit code','Business competent unit','Registration authority','Types of social organizations','Start-up funds','Scope of business','Legal representative','Telephone','address','Zip code','Registration status','date of establishment','Industry classification']#Header
for h in range(len(head)):
    sheet.write(0,h,head[h])    #Writing table

Crawling web pages are accessed by requests and parsed by beautiful soup.

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser', from_encoding='utf-8')

After that, we extract the effective fields in the web page content, and use soup.stripped'strings to remove the blank space and blank line content.

str1 = []
    nice = []
    for wz in soup.stripped_strings:
        str1.append(repr(wz))
    k = len(str1)

Finally, according to the different needs of each person, the data is sorted out. Here, insert, pop and append are used to adjust the data.

The complete code is as follows:

# coding:utf-8
import requests
from bs4 import BeautifulSoup
import operator as op
import re
import xlwt

user_agent = 'Mozilla/4.0 (compatible;MSIE5.5;windows NT)'
headers = {'User-Agent': user_agent}
num=1
book = xlwt.Workbook(encoding='utf-8')
sheet = book.add_sheet('ke_qq')
head = ['Organization name','Registration Certificate No.','Unified social credit code','Business competent unit','Registration authority','Types of social organizations','Start-up funds','Scope of business','Legal representative','Telephone','address','Zip code','Registration status','date of establishment','Industry classification']#Header
for h in range(len(head)):
    sheet.write(0,h,head[h])    #Writing table
for one in range(10001,17000):
    keyword = 10000000001
    keywords=keyword+one
    url = 'http://mzj.beijing.gov.cn/wssbweb/wssb/dc/orgInfo.do?action=seeParticular&orgId=0000' + str(keywords) + '&websitId=&netTypeId='
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser', from_encoding='utf-8')
    str1 = []
    nice = []
    for wz in soup.stripped_strings:
        str1.append(repr(wz))
    k = len(str1)
    if k>5:
        i = 1
        for content in str1:
            if i > 3:
                nice.append(content)
            i = i + 1
        try:
            # num=num+1
            if  op.eq(nice[4], '\'Competent business unit:\''):
                nice.insert(4, 'nothing')
            if op.eq(nice[14], '\'Legal representative/Person in charge:\''):
                nice.insert(14, 'nothing')
            if op.eq(nice[13], '\'Activity area:\''):
                nice.pop(13)
                nice.pop(13)
            if op.eq(nice[16], '\'Telephone:\''):
                nice.insert(16, 'nothing')
            if op.eq(nice[18], '\'Address:\''):
                nice.insert(18, 'nothing')
            if op.eq(nice[20], '\'Zip code:\''):
                nice.insert(20, 'nothing')
            if len(nice)>22:
                if op.eq(nice[22], '\'Registration status:\''):
                    nice.insert(22, 'nothing')
            if len(nice) > 27:
                if op.eq(nice[27], '\'Industry Classification:\'') and len(nice) == 28:
                    nice.append('nothing')
                # if op.eq(nice[13], '\' active region: \ '):
                #   nice.pop(13)
                #  nice.pop(13)
            if op.eq(nice[12], '\'element\''):
                nice[12] = '0'
            # print(nice)
            j = 0
            d = 0
            s = 0
            for data in nice:
                if j & 1 == 0:
                    s = j - d
                    sheet.write(num, s, data)
                    d += 1
                j += 1
            print(num)
            num += 1
        except:
            print('error'+num)

book.save('E:\WU\pyfile\shuju\save2\shuju2.xls')

Because of different crawling web pages, different methods may be adopted for keyword in web page address.

Posted by rrhody on Sun, 05 Jan 2020 16:38:22 -0800