Crawl the data of Tianya forum to see what everyone is talking about

preface

The text and pictures of this article are from the Internet, only for learning and communication, not for any commercial purpose. The copyright belongs to the original author. If you have any questions, please contact us in time for handling.

When we browse the web, the browser will render and output HTML, JS, CSS and other information; through these elements, we can see the news, pictures, movies, comments, commodities and so on that we want to view. In general, we can see the content we need, and the pictures may be copied and saved by downloading the pictures. But if we face a large number of words and pictures, we can't handle them manually. For example, baidu needs to get a large number of the latest website chapters and records them regularly every day. We can't do these massive data and daily scheduled work manually In this case, the function of reptiles is reflected.

Content introduction:

Don't say much, just start our forum crawler journey.

1. Module import

# encoding:utf8
import requestsfrom bs4 import BeautifulSoup

Import the requests network data request module for web crawlers. Import the beautifulsop tail page parsing module for web page data processing.

2. Get url resource

def getHtmlList(list, url, main_url):
    try:
        soup = getHtmlText(url)
        managesInfo = soup.find_all('td', attrs={'class': 'td-title faceblue'})
        for m in range(len(managesInfo)):
            a = managesInfo[m].find_all('a') #Get the location of the post
            for i in a:
                try:
                    href = i.attrs['href']
                    list.append(main_url + href) #Put the post's url Store in list in
                except:
                    continue
            print(list)
    except:
        print("Failed to get web page")

Get a url, through the requests.get() method, get the information of the page, which is a module to get url resources.

3. Get the list of child posts

def getHtmlList(list, url, main_url):
    try:
        soup = getHtmlText(url)
        managesInfo = soup.find_all('td', attrs={'class': 'td-title faceblue'})
        for m in range(len(managesInfo)):
            a = managesInfo[m].find_all('a') #Get the location of the post
            for i in a:
                try:
                    href = i.attrs['href']
                    list.append(main_url + href) #Put the post's url Store in list in
                except:
                    continue
            print(list)
    except:
        print("Failed to get web page")

Get a url, call the first function to parse the financial forum page, get the url of the sub post, and store it in the list. This method gets the network links of all sub posts under the link, and prepares for the next data crawling. The list of sub posts is as follows:

4. Parsing page

def getHtmlInfo(list, fpath):
    for i in list:
        infoDict = {} #Initialize the dictionary to store all the information to be obtained by the post
        authorInfo = [] #Initializes the list of authors' information for posting comments
        comment = [] #Initializes the list of messages for post comments
        try:
            soup = getHtmlText(i)
            if soup == "": #Skip if page does not exist, continue to get
                continue
            Info = soup.find('span', attrs={'style': 'font-weight:400;'})
            title = Info.text # Get the title of the post
            infoDict.update({'Forum topics:  ': title}) #Store the title content of the post in the dictionary
            author = soup.find_all('div', attrs={'class': 'atl-info'})
            for m in author:
                authorInfo.append(m.text) #Store the author's information of the comments in the post in the list
            author = soup.find_all('div', attrs={'class': 'bbs-content'})
            for m in author:
                comment.append(m.text) #Store the comment information of the post in the list
            for m in range(len(authorInfo)):
                key = authorInfo[m] + '\n'
                value = comment[m] + '\n'
                infoDict[key] = value # Store the author's information of the comment and the content of the comment in the form of key value pairs
            # Store the acquired information in the designated location
            with open(fpath, 'a', encoding='utf-8')as f:
                for m in infoDict:
                    f.write(str(m) + '\n')
                    f.write(str(infoDict[m]) + '\n')
        except:
            continue

Loop the url in the list through for to a parsing page, get the content we want, and then store the content in the designated computer location.

5. Incoming parameters

def main():
    main_url = 'http://bbs.tianya.cn'
    develop_url = 'http://bbs.tianya.cn/list-1109-1.shtml'
    #develop_url = 'http://bbs.tianya.cn/list-develop-1.shtml'
    ulist = []

    fpath = r'E:\tianya.txt'
    getHtmlList(ulist, develop_url, main_url)
    getHtmlInfo(ulist, fpath)
main() # function main function

Enter the name of the crawled web page and the path to save the data. This paper does not further analyze the crawled data. The results are as follows, including the content of the main post that has been followed.

No matter you are zero foundation or have foundation, you can get the corresponding study gift pack! It includes Python software tools and the latest introduction to practice in 2020. Add 695185429 for free.

Posted by pck76 on Wed, 13 May 2020 00:41:54 -0700

Programmer Group

Crawl the data of Tianya forum to see what everyone is talking about

preface

Hot Keywords