bs4 crawls comics and writes them to the TXT file

Keywords: Big Data Python xml Windows

What we bring today is to crawl the comic information and links on the comic website

This time, we will use bs4, which is Beautiful Soup
Let's introduce * * Beautiful Soup * *:
Beautiful Soup is a Python library that can extract data from HTML or XML files. It can realize the usual way of document navigation, search and modification through your favorite converter. Beautiful Soup will save you hours or even days of working time

There are four types of objects:

  • Tag
  • Navigable String
  • Beautiful Soup
    (the Beautiful Soup object represents the whole content of a document)
  • Comment

Here are some ways to get what you need with bs4

  • soup.a
    Get the first Tag of the current name through the point property
  • soup.find_all('a')
    Get all a Tags

Here is also a distinction between find() and find'all()

  • find()
    Function returns only the type of the first node label to find
  • find_all()
    Function returns the type of all node labels to be found in the form of a list

OK, let's go straight to the code

# -*- coding:utf-8 -*-
from bs4 import BeautifulSoup
import requests

url = 'https://manhua.dmzj.com/'
def get_page(finallyurl):
    user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134'
    headers = {'User-Agent': user_agent}
    data = requests.get(finallyurl,headers=headers).content
    return data
def get_manhua(html):
    fo = open("new.txt", 'a')
    fo.write('**********************The first%s page*************************\n'%i)
    fo.close()
    soup = BeautifulSoup(html,'html.parser')
    article = soup.find('div',attrs={'class':'newpic_content'})
    text = []
    for paragraph in soup.find_all("div",attrs={"class":"boxdiv1"}):
        p_content = paragraph.get_text()     #Get all about text
        text.append(p_content)
        print p_content
        a=[]
        for link in paragraph.find_all(target="_blank"):
            lianjie = 'https://manhua.dmzj.com/' + link.get('href')
            a.append(lianjie)
            print lianjie
        end = a[1] + '\n' + p_content + '\n'
        #Write document
        one = end.encode('UTF-8')
        fo = open("new.txt", 'a')
        fo.write(one)
        fo.write('\n')
        fo.close()
        # Delete empty lines in document
        f = open('new.txt')
        g = open('result.txt', 'w')
        try:
            while True:
                 line = f.readline()
                 if len(line) == 0:
                      break
                 if line.count('\n') == len(line):
                     continue
                 g.write(line)
        finally:
            f.close()
            g.close()
    return text
for i in range(1,4):
    finallyurl = url + 'update_' + str(i) + '.shtml'
    html = get_page(finallyurl)
    text = get_manhua(html)

Result display

Posted by dcjones on Tue, 10 Dec 2019 17:22:02 -0800