Crawler - crawling web page data into form

Keywords: encoding Excel Windows

Recently, due to personal needs, self-study crawlers from related books and online materials. The target website is The content is sorted and screened and stored in excel format.

First, the content of the table is set. The encoding format is defined as utf-8. Add a sheet table, where the head is the content of the header. After the definition, use sheet.write to write the content of the header.

book = xlwt.Workbook(encoding='utf-8')
sheet = book.add_sheet('ke_qq')
head = ['Organization name','Registration Certificate No.','Unified social credit code','Business competent unit','Registration authority','Types of social organizations','Start-up funds','Scope of business','Legal representative','Telephone','address','Zip code','Registration status','date of establishment','Industry classification']#Header
for h in range(len(head)):
    sheet.write(0,h,head[h])    #Writing table

Crawling web pages are accessed by requests and parsed by beautiful soup.

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser', from_encoding='utf-8')

After that, we extract the effective fields in the web page content, and use soup.stripped'strings to remove the blank space and blank line content.

str1 = []
    nice = []
    for wz in soup.stripped_strings:
    k = len(str1)

Finally, according to the different needs of each person, the data is sorted out. Here, insert, pop and append are used to adjust the data.

The complete code is as follows:

# coding:utf-8
import requests
from bs4 import BeautifulSoup
import operator as op
import re
import xlwt

user_agent = 'Mozilla/4.0 (compatible;MSIE5.5;windows NT)'
headers = {'User-Agent': user_agent}
book = xlwt.Workbook(encoding='utf-8')
sheet = book.add_sheet('ke_qq')
head = ['Organization name','Registration Certificate No.','Unified social credit code','Business competent unit','Registration authority','Types of social organizations','Start-up funds','Scope of business','Legal representative','Telephone','address','Zip code','Registration status','date of establishment','Industry classification']#Header
for h in range(len(head)):
    sheet.write(0,h,head[h])    #Writing table
for one in range(10001,17000):
    keyword = 10000000001
    url = '' + str(keywords) + '&websitId=&netTypeId='
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser', from_encoding='utf-8')
    str1 = []
    nice = []
    for wz in soup.stripped_strings:
    k = len(str1)
    if k>5:
        i = 1
        for content in str1:
            if i > 3:
            i = i + 1
            # num=num+1
            if  op.eq(nice[4], '\'Competent business unit:\''):
                nice.insert(4, 'nothing')
            if op.eq(nice[14], '\'Legal representative/Person in charge:\''):
                nice.insert(14, 'nothing')
            if op.eq(nice[13], '\'Activity area:\''):
            if op.eq(nice[16], '\'Telephone:\''):
                nice.insert(16, 'nothing')
            if op.eq(nice[18], '\'Address:\''):
                nice.insert(18, 'nothing')
            if op.eq(nice[20], '\'Zip code:\''):
                nice.insert(20, 'nothing')
            if len(nice)>22:
                if op.eq(nice[22], '\'Registration status:\''):
                    nice.insert(22, 'nothing')
            if len(nice) > 27:
                if op.eq(nice[27], '\'Industry Classification:\'') and len(nice) == 28:
                # if op.eq(nice[13], '\' active region: \ '):
                #   nice.pop(13)
                #  nice.pop(13)
            if op.eq(nice[12], '\'element\''):
                nice[12] = '0'
            # print(nice)
            j = 0
            d = 0
            s = 0
            for data in nice:
                if j & 1 == 0:
                    s = j - d
                    sheet.write(num, s, data)
                    d += 1
                j += 1
            num += 1

Because of different crawling web pages, different methods may be adopted for keyword in web page address.

Posted by rrhody on Sun, 05 Jan 2020 16:38:22 -0800