Using python's crawler technology to crawl Baidu Post Bar posts

Keywords: Python encoding Windows Programming

After crawling the section of Encyclopedia of Gongshi, I found an example of crawling Baidu Post Bar posts. In order to consolidate and improve the knowledge of crawlers, I intend to make one myself.

Achieving goals: 1. Climbing the posts sent by the landlord

2. Display the floor climbed and the title of the post

3. Write the crawled content into the file and display the crawling progress dynamically.

Implementation tools: python's requests library, regular expressions and bs4 Library

First of all, we crawled the website of the post: https://tieba.baidu.com/p/3138733512? See_lz=1&pn=1. This website only looks at the website of the landlord's post, so the source code content of the website is the content of the landlord's post, and it is more convenient to climb up. We found that there are five pages of posts to crawl, and we can crawl every page of information through the for loop.

Next, let's build the whole idea of crawling:

1. Crawl the source code of the page

2. Extracting the required content with regular expressions

3. Use regular matching to precisely modify the extracted content to achieve what we want

4. Write content to a file and display the progress of writing

Following is the implementation of each step:

The first is to get the source code, which is relatively simple. Most of the ways to get the source code can be realized by this code:

def getHTMLText(url):
    try:
        user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
        headers = {'User-Agent': user_agent}
        r = requests.get(url,headers = headers)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

The user_agent configuration can be found in the source code of the web page. The purpose of the configuration is to disguise the crawler as a user in order to get a better crawling experience.

Next, we use regular expressions to get the "title", "main content of the post" and "floor" information we need.

By analyzing the source code, we find that the title is in the

<title>......</title>

You can find the main content of the post in the

<div id="post_content_\d*" class="d_post_content j_d_post_content ">......</div>

You can find "floor" information.

<span class="tail-info">......</span><span class="tail-info">

Find it in. Whereas "..." means the content to be extracted, we use two functions to extract it.

def printTitle(html):
    try:
        soup = BeautifulSoup(html, "html.parser")
        titleTag = soup.find_all('title')
        patten = re.compile(r'<title>(.*?)</title>', re.S)
        title = re.findall(patten, str(titleTag))
        return title
    except:
        return ""
def fillUnivlist(lis,li,html):
    try:
        patten = re.compile(r'<div id="post_content_\d*" class="d_post_content j_d_post_content ">(.*?)</div>', re.S)
        nbaInfo = re.findall(patten, str(html))
        pattenFloor = re.compile(r'<span class="tail-info">(\d*floor)</span><span class="tail-info">', re.S)
        floorText = re.findall(pattenFloor, str(html))
        number = len(nbaInfo)
        for i in range(number):
            Info = textTools.remove(nbaInfo[i])
            Info1 = textTools.remove(floorText[i])
            lis.append(Info1)
            li.append(Info)
    except:
        return ""

We use try except for each method to ensure its robustness.

But we found that we had a lot of superfluous elements in the content of the extracted posts:

<img class="BDE_Image" src="https://imgsa.baidu.com/forum/w%3D580/sign=cb6ab1f8708b4710ce2ffdc4f3ccc3b2/06381f30e924b899d8ca30e16c061d950b7bf671.jpg" pic_ext="jpeg"  pic_type="0" width="339" height="510"><br><br><br><br>50 Surprise New King <a href="http://jump2.bdimg.com/safecheck/index?url=x+Z5mMbGPAsY/M/Q/im9DR3tEqEFWbC4Yzg89xsWivS12AkS11WcjnMQsTddE2yXZInIi4k8KEu5449mWp1SxBADVCHPuUFSTGH+WZuV+ecUBG6CY6mAz/Zq1mzxbFxzAG+4Cm4FSU0="  class="ps_cb" target="_blank" onclick="$.stats.track(0, \'nlp_ps_word\',{obj_name:\'Michael Carter-Williams\'});$.stats.track(\'Pb_content_wordner\',\'ps_callback_statics\')">Michael Carter-Williams</a><br>Last season data<br>Rebound 6.2  Assist 6.3  Snatch 1.9 Cap 0.6 Mistake 3.5 Foul 3 scored 16.7<br><br><br>       The 50th place in the new season. I gave the rookie King last season.<a href="http://jump2.bdimg.com/safecheck/index?url=x+Z5mMbGPAsY/M/Q/im9DR3tEqEFWbC4Yzg89xsWivS12AkS11WcjnMQsTddE2yXZInIi4k8KEu5449mWp1SxBADVCHPuUFSTGH+WZuV+ecUBG6CY6mAz/Zq1mzxbFxzAG+4Cm4FSU0="  class="ps_cb" target="_blank" onclick="$.stats.track(0, \'nlp_ps_word\',{obj_name:\'Michael Carter-Williams\'});$.stats.track(\'Pb_content_wordner\',\'ps_callback_statics\')">Michael Carter-Williams</a>.  Last season McCarvey was rebuilding completely.<a href="http://jump2.bdimg.com/safecheck/index?url=x+Z5mMbGPAsY/M/Q/im9DR3tEqEFWbC4Yzg89xsWivTbCBRGuF91e6cwvXwi+nOsUCFQWyjKvntqT9uy6c+e1s3eo9XM+kBUaJGaqtq7WOznXcLnooXruQBvuApuBUlN"  class="ps_cb" target="_blank" onclick="$.stats.track(0, \'nlp_ps_word\',{obj_name:\'76 people\'});$.stats.track(\'Pb_content_wordner\',\'ps_callback_statics\')">76 people</a>China quickly mastered the team and won thousands of eyeballs from the start. Later, many experienced performances, the rookie season won three pairs of players are not many, Micahway can now be said to have a firm foothold in 76 people.<br>       As the head of last season's weak team,<a href="http://jump2.bdimg.com/safecheck/index?url=x+Z5mMbGPAsY/M/Q/im9DR3tEqEFWbC4Yzg89xsWivS12AkS11WcjnMQsTddE2yXZInIi4k8KEu5449mWp1SxBADVCHPuUFSTGH+WZuV+ecUBG6CY6mAz/Zq1mzxbFxzAG+4Cm4FSU0="  class="ps_cb" target="_blank" onclick="$.stats.track(0, \'nlp_ps_word\',{obj_name:\'Michael Carter-Williams\'});$.stats.track(\'Pb_content_wordner\',\'ps_callback_statics\')">Michael Carter-Williams</a>It's a good data, but when we calm down and look at him, we still find that he has a lot of problems. First of all, shooting is just 40.%The hit rate and the bleak 26%The three-point shooting rate is definitely not qualified! Additionally, the body is thin and the speed of each word is generally high and large. The defensive side does not perform as well as the data. As a guard, there are too many mistakes, and there is still a certain gap from the superstar. Whether you fly into the sky or fall quickly depends on your efforts!<br>       After discussing the shortcomings and advantages, as a guard rebound is very prominent, tall body shape can better affect the opponent's starting, but also find their own empty players. Breakthrough although the speed is average, but the rhythm is good, the overall outlook is above the average level. Reminds thin and tall, can not shoot, breakthrough rhythm is good, the overall situation is good! Who did that say a few years ago? Right before the broken leg.<a href="http://jump2.bdimg.com/safecheck/index?url=x+Z5mMbGPAsY/M/Q/im9DR3tEqEFWbC4Yzg89xsWivT5ggWFC92MLwFHpDNBmn4rETPyFf5XUHwripOOA15C4U+GRIwDgEI46b99l0XyUM/jR49NyMTc/6qmUGNB+hoByExmB9N/65I="  class="ps_cb" target="_blank" onclick="$.stats.track(0, \'nlp_ps_word\',{obj_name:\'Livingston\'});$.stats.track(\'Pb_content_wordner\',\'ps_callback_statics\')">Livingston</a>! <br>       As far as team status is concerned,<a href="http://jump2.bdimg.com/safecheck/index?url=x+Z5mMbGPAsY/M/Q/im9DR3tEqEFWbC4Yzg89xsWivS12AkS11WcjnMQsTddE2yXZInIi4k8KEu5449mWp1SxBADVCHPuUFSTGH+WZuV+ecUBG6CY6mAz/Zq1mzxbFxzAG+4Cm4FSU0="  class="ps_cb" target="_blank" onclick="$.stats.track(0, \'nlp_ps_word\',{obj_name:\'Michael Carter-Williams\'});$.stats.track(\'Pb_content_wordner\',\'ps_callback_statics\')">Michael Carter-Williams</a>Now is the absolute boss, the ball you want to play as you want, the data you want to brush as you want! Last year's Potential Newcomers<a href="http://jump2.bdimg.com/safecheck/index?url=x+Z5mMbGPAsY/M/Q/im9DR3tEqEFWbC4Yzg89xsWivTKm3O5uii9sKBrDcAE8/xDK4qTjgNeQuFPhkSMA4BCOOm/fZdF8lDP40ePTcjE3P+qplBjQfoaAchMZgfTf+uS"  class="ps_cb" target="_blank" onclick="$.stats.track(0, \'nlp_ps_word\',{obj_name:\'Knoll\'});$.stats.track(\'Pb_content_wordner\',\'ps_callback_statics\')">Knoll</a>yes<a href="http://jump2.bdimg.com/safecheck/index?url=x+Z5mMbGPAsY/M/Q/im9DR3tEqEFWbC4Yzg89xsWivQdeSiO+EjvouPd1sAEaAOyK4qTjgNeQuFPhkSMA4BCOOm/fZdF8lDP40ePTcjE3P+qplBjQfoaAchMZgfTf+uS"  class="ps_cb" target="_blank" onclick="$.stats.track(0, \'nlp_ps_word\',{obj_name:\'Blue collar\'});$.stats.track(\'Pb_content_wordner\',\'ps_callback_statics\')">Blue collar</a>,Everyone else can retire. Embed is still injured and can't play.<a href="http://jump2.bdimg.com/safecheck/index?url=x+Z5mMbGPAsY/M/Q/im9DR3tEqEFWbC4Yzg89xsWivTbCBRGuF91e6cwvXwi+nOsUCFQWyjKvntqT9uy6c+e1s3eo9XM+kBUaJGaqtq7WOznXcLnooXruQBvuApuBUlN"  class="ps_cb" target="_blank" onclick="$.stats.track(0, \'nlp_ps_word\',{obj_name:\'76 people\'});$.stats.track(\'Pb_content_wordner\',\'ps_callback_statics\')">76 people</a>The team's record depends on you! But by the time Noel matures (if it's not aquatic), Embed recovers (he's technically impossible to get water, he's just sick) and you have a good team of insiders! What results can you bring them to? It's time to test your McCarvey's ability to brush data.'

There is a lot of redundant information in the extracted information, so we need to subdivide it. The basic idea is to match the redundant information with regularity, and then replace the redundant information with blanks or lines by regular substitution method.

Here, let's build a class that handles information.

class Tools:
    removeImg = re.compile('<img.*?>')
    removBr = re.compile('<br>')
    removeHef = re.compile('<a href.*?>')
    removeA = re.compile('</a>')
    removeClass = re.compile('<a class.*?>|<aclass.*?>')
    removeNull = re.compile(' ')


    def remove(self,te):
        te = re.sub(self.removeImg,'',te)
        te = re. sub(self.removBr,'\n',te)
        te = re.sub(self.removeHef,'',te)
        te = re.sub(self.removeA,'',te)
        te = re.sub(self.removeClass,'',te)
        te = re.sub(self.removeNull, '', te)
        return  te

After processing the scrambled information in this class, we can get the following information:

50 surprise newcomer Wang Mikawi
 Last season data
 Rebound 6.2 assists 6.3 steals 1.9 blocks 0.6 errors 3.5 fouls 3 scores 16.7


In the 50th place of the new season, I gave last season's rookie Wang Mikawi. Last season, McCarvey quickly mastered the team among the 76 completely rebuilt players, winning thousands of eyeballs from the start with three pairs. Later, many experienced performances, the rookie season won three pairs of players are not many, Micahway can now be said to have a firm foothold in 76 people.
As the head of last season's weak team, McCarvey has produced good data, but when we take a look at him, we still find that he has a lot of problems. First of all, shooting just 40% of the weak hit rate and the bleak 26% of the three-point shooting rate is certainly not qualified! Additionally, the body is thin and the speed of each word is generally high and large. The defensive side does not perform as well as the data. As a guard, there are too many mistakes, and there is still a certain gap from the superstar. Whether you fly into the sky or fall quickly depends on your efforts!
After discussing the shortcomings and advantages, as a guard rebound is very prominent, tall body shape can better affect the opponent's starting, but also find their own empty players. Breakthrough although the speed is average, but the rhythm is good, the overall outlook is above the average level. Reminds thin and tall, can not shoot, breakthrough rhythm is good, the overall situation is good! Who did that say a few years ago? Yes, Livingston before the broken leg!
As far as the status of the team is concerned, McCarvey is the absolute elder now. You can play the ball as you want, and you can brush the data as you want. Last year's potential rookie Noel was a blue-collar. Everyone else could retire. Embed was still injured and could not play. How well the 76ers did depends on you! But by the time Noel matures (if it's not aquatic), Embed recovers (he's technically impossible to get water, he's just sick) and you have a good team of insiders! What results can you bring them to? It's time to test your McCarvey's ability to brush data.

This kind of expression effect allows me to see the extracted information clearly, so this class is successful. Next we just need to output the extracted information.

Let's first write a way to write the title information and the body content, because the title is only on the first page, so you can write a method separately.

def writeText(titleText,fpath):
    try:
        with open(fpath, 'a', encoding='utf-8') as f:
            f.write(str(titleText) + '\n')
            f.write('\n')
            f.close()
    except:
        return ""
def writeUnivlist(lis,li,fpath,num):
    with open(fpath, 'a', encoding='utf-8') as f:
        for i in range(num):
            f.write(str(lis[i])+'\n')
            f.write('*'*50 + '\n')
            f.write(str(li[i]) + '\n')
            f.write('*' * 50 + '\n')
        f.close()

Next, we just need to write a main function to execute. Let's define the path to write to the file, and then write the title of the file first.

    count = 0
    url = 'https://tieba.baidu.com/p/3138733512?see_lz=1&pn=1'
    output_file = 'D:/StockInfo.txt'
    html = getHTMLText(url)
    titleText = printTitle(html)
    writeText(titleText, output_file)

Next, the for loop is used to input the information of each web page and print the progress of writing to the file.

    for i in range(5):
        i = i + 1
        lis = []
        li = []
        url = 'https://tieba.baidu.com/p/3138733512?see_lz=1&pn=' + str(i)
        html = getHTMLText(url)
        fillUnivlist(lis, li, html)
        writeUnivlist(lis, li, output_file, len(lis))
        count = count + 1
        print("\r Current progress: {:.2f}%".format(count * 100 / 5), end="")

The above is the content of Baidu Post Bar. Finally, I think if we encapsulate these function methods into a class, the effect will be better.

Here's all the code

import  requests
from bs4 import BeautifulSoup
import re


class Tools:
    removeImg = re.compile('<img.*?>')
    removBr = re.compile('<br>')
    removeHef = re.compile('<a href.*?>')
    removeA = re.compile('</a>')
    removeClass = re.compile('<a class.*?>|<aclass.*?>')
    removeNull = re.compile(' ')


    def remove(self,te):
        te = re.sub(self.removeImg,'',te)
        te = re. sub(self.removBr,'\n',te)
        te = re.sub(self.removeHef,'',te)
        te = re.sub(self.removeA,'',te)
        te = re.sub(self.removeClass,'',te)
        te = re.sub(self.removeNull, '', te)
        return  te

textTools = Tools()

def getHTMLText(url):
    try:
        user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
        headers = {'User-Agent': user_agent}
        r = requests.get(url,headers = headers)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

def printTitle(html):
    try:
        soup = BeautifulSoup(html, "html.parser")
        titleTag = soup.find_all('title')
        patten = re.compile(r'<title>(.*?)</title>', re.S)
        title = re.findall(patten, str(titleTag))
        return title
    except:
        return ""

def fillUnivlist(lis,li,html):
    try:
        patten = re.compile(r'<div id="post_content_\d*" class="d_post_content j_d_post_content ">(.*?)</div>', re.S)
        nbaInfo = re.findall(patten, str(html))
        pattenFloor = re.compile(r'<span class="tail-info">(\d*floor)</span><span class="tail-info">', re.S)
        floorText = re.findall(pattenFloor, str(html))
        number = len(nbaInfo)
        for i in range(number):
            Info = textTools.remove(nbaInfo[i])
            Info1 = textTools.remove(floorText[i])
            lis.append(Info1)
            li.append(Info)
    except:
        return ""

def writeText(titleText,fpath):
    try:
        with open(fpath, 'a', encoding='utf-8') as f:
            f.write(str(titleText) + '\n')
            f.write('\n')
            f.close()
    except:
        return ""

def writeUnivlist(lis,li,fpath,num):
    with open(fpath, 'a', encoding='utf-8') as f:
        for i in range(num):
            f.write(str(lis[i])+'\n')
            f.write('*'*50 + '\n')
            f.write(str(li[i]) + '\n')
            f.write('*' * 50 + '\n')
        f.close()

def main():
    count = 0
    url = 'https://tieba.baidu.com/p/3138733512?see_lz=1&pn=1'
    output_file = 'D:/StockInfo.txt'
    html = getHTMLText(url)
    titleText = printTitle(html)
    writeText(titleText, output_file)
    for i in range(5):
        i = i + 1
        lis = []
        li = []
        url = 'https://tieba.baidu.com/p/3138733512?see_lz=1&pn=' + str(i)
        html = getHTMLText(url)
        fillUnivlist(lis, li, html)
        writeUnivlist(lis, li, output_file, len(lis))
        count = count + 1
        print("\r Current progress: {:.2f}%".format(count * 100 / 5), end="")

main()

There are still many perfections in this area. I hope you can give us more advice.

From a little white man who loves self-taught programming

Posted by bmarinho on Tue, 16 Jul 2019 11:10:00 -0700