preface
The text and pictures of this article are from the Internet, only for learning and communication, not for any commercial purpose. The copyright belongs to the original author. If you have any questions, please contact us in time for handling.
When we browse the web, the browser will render and output HTML, JS, CSS and other information; through these elements, we can see the news, pictures, movies, comments, commodities and so on that we want to view. In general, we can see the content we need, and the pictures may be copied and saved by downloading the pictures. But if we face a large number of words and pictures, we can't handle them manually. For example, baidu needs to get a large number of the latest website chapters and records them regularly every day. We can't do these massive data and daily scheduled work manually In this case, the function of reptiles is reflected.
Content introduction:
Don't say much, just start our forum crawler journey.
1. Module import
# encoding:utf8 import requestsfrom bs4 import BeautifulSoup
Import the requests network data request module for web crawlers. Import the beautifulsop tail page parsing module for web page data processing.
2. Get url resource
def getHtmlList(list, url, main_url): try: soup = getHtmlText(url) managesInfo = soup.find_all('td', attrs={'class': 'td-title faceblue'}) for m in range(len(managesInfo)): a = managesInfo[m].find_all('a') #Get the location of the post for i in a: try: href = i.attrs['href'] list.append(main_url + href) #Put the post's url Store in list in except: continue print(list) except: print("Failed to get web page")
Get a url, through the requests.get() method, get the information of the page, which is a module to get url resources.
3. Get the list of child posts
def getHtmlList(list, url, main_url): try: soup = getHtmlText(url) managesInfo = soup.find_all('td', attrs={'class': 'td-title faceblue'}) for m in range(len(managesInfo)): a = managesInfo[m].find_all('a') #Get the location of the post for i in a: try: href = i.attrs['href'] list.append(main_url + href) #Put the post's url Store in list in except: continue print(list) except: print("Failed to get web page")
Get a url, call the first function to parse the financial forum page, get the url of the sub post, and store it in the list. This method gets the network links of all sub posts under the link, and prepares for the next data crawling. The list of sub posts is as follows:
4. Parsing page
def getHtmlInfo(list, fpath): for i in list: infoDict = {} #Initialize the dictionary to store all the information to be obtained by the post authorInfo = [] #Initializes the list of authors' information for posting comments comment = [] #Initializes the list of messages for post comments try: soup = getHtmlText(i) if soup == "": #Skip if page does not exist, continue to get continue Info = soup.find('span', attrs={'style': 'font-weight:400;'}) title = Info.text # Get the title of the post infoDict.update({'Forum topics: ': title}) #Store the title content of the post in the dictionary author = soup.find_all('div', attrs={'class': 'atl-info'}) for m in author: authorInfo.append(m.text) #Store the author's information of the comments in the post in the list author = soup.find_all('div', attrs={'class': 'bbs-content'}) for m in author: comment.append(m.text) #Store the comment information of the post in the list for m in range(len(authorInfo)): key = authorInfo[m] + '\n' value = comment[m] + '\n' infoDict[key] = value # Store the author's information of the comment and the content of the comment in the form of key value pairs # Store the acquired information in the designated location with open(fpath, 'a', encoding='utf-8')as f: for m in infoDict: f.write(str(m) + '\n') f.write(str(infoDict[m]) + '\n') except: continue
Loop the url in the list through for to a parsing page, get the content we want, and then store the content in the designated computer location.
5. Incoming parameters
def main(): main_url = 'http://bbs.tianya.cn' develop_url = 'http://bbs.tianya.cn/list-1109-1.shtml' #develop_url = 'http://bbs.tianya.cn/list-develop-1.shtml' ulist = [] fpath = r'E:\tianya.txt' getHtmlList(ulist, develop_url, main_url) getHtmlInfo(ulist, fpath) main() # function main function
Enter the name of the crawled web page and the path to save the data. This paper does not further analyze the crawled data. The results are as follows, including the content of the main post that has been followed.
No matter you are zero foundation or have foundation, you can get the corresponding study gift pack! It includes Python software tools and the latest introduction to practice in 2020. Add 695185429 for free.