What we bring today is to crawl the comic information and links on the comic website
This time, we will use bs4, which is Beautiful Soup
Let's introduce * * Beautiful Soup * *:
Beautiful Soup is a Python library that can extract data from HTML or XML files. It can realize the usual way of document navigation, search and modification through your favorite converter. Beautiful Soup will save you hours or even days of working time
There are four types of objects:
- Tag
- Navigable String
- Beautiful Soup
(the Beautiful Soup object represents the whole content of a document) - Comment
Here are some ways to get what you need with bs4
- soup.a
Get the first Tag of the current name through the point property - soup.find_all('a')
Get all a Tags
Here is also a distinction between find() and find'all()
- find()
Function returns only the type of the first node label to find - find_all()
Function returns the type of all node labels to be found in the form of a list
OK, let's go straight to the code
# -*- coding:utf-8 -*- from bs4 import BeautifulSoup import requests url = 'https://manhua.dmzj.com/' def get_page(finallyurl): user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134' headers = {'User-Agent': user_agent} data = requests.get(finallyurl,headers=headers).content return data def get_manhua(html): fo = open("new.txt", 'a') fo.write('**********************The first%s page*************************\n'%i) fo.close() soup = BeautifulSoup(html,'html.parser') article = soup.find('div',attrs={'class':'newpic_content'}) text = [] for paragraph in soup.find_all("div",attrs={"class":"boxdiv1"}): p_content = paragraph.get_text() #Get all about text text.append(p_content) print p_content a=[] for link in paragraph.find_all(target="_blank"): lianjie = 'https://manhua.dmzj.com/' + link.get('href') a.append(lianjie) print lianjie end = a[1] + '\n' + p_content + '\n' #Write document one = end.encode('UTF-8') fo = open("new.txt", 'a') fo.write(one) fo.write('\n') fo.close() # Delete empty lines in document f = open('new.txt') g = open('result.txt', 'w') try: while True: line = f.readline() if len(line) == 0: break if line.count('\n') == len(line): continue g.write(line) finally: f.close() g.close() return text for i in range(1,4): finallyurl = url + 'update_' + str(i) + '.shtml' html = get_page(finallyurl) text = get_manhua(html)
Result display