Simple application of beautiful soup and aiohttp crawling resources

Keywords: Python Session

Hong Kong Radio's program quality is quite good. One of the programs "five thousand years of China" shows historical stories in the form of sitcoms and narratives. From the legendary era to the Republic of China, it was first broadcast in 1983 to 2000. It is a very long-lived program. The voice of the version that can be found online is very fuzzy, but all the programs can be listened to online on its website of "five thousand years of China online". Although you can listen online, you need science to go online, and you can't continue listening after the online listening is interrupted. It's hard. Therefore, I think of using the crawler from Python to download all the programs and listen slowly.

Analyze Html page

Open the review element in the browser to find the link tag of the audio, and find that the links are all in the a tag of class. Listen button. Just navigate to the tag, take out text as the file name, and use href as the download url.

code implementation

The code is very simple. First, the main structure is as follows:

'''
    //Download 5000 years of China
'''
from bs4 import BeautifulSoup
import requests,urllib,re
import time
import aiohttp
import asyncio
import os

async def main():
    start_page = 1
    while True:     
        url = 'http://rthk9.rthk.hk/chiculture/fivethousandyears/subpage{0}.htm'.format(start_page)
        soup = await getUrl(url)      #take html content
        if not soup.title: return   #Until nothing exits
        title = soup.title.text 
        title = title[title.rfind(' ')+1:]
        listenbutton = soup.select(".listen-button") #Find out all.listen-button Class labels
        #according to title Create the corresponding folder
        rootPath = './Five thousand years of China/'
        if not os.path.exists(rootPath + title):
            os.makedirs(rootPath + title)

        for l in listenbutton:
            if  l.text != "":
                href = l['href']
                filename  = str(title) +'_' + str(l.text)
                if filename.find('Ad') > -1
                    await download(filename=filename,url=href,title=title)  #Download voice

        start_page += 1 #next page

asyncio.run(main())

Where asynchronous function (coroutine) getUrl:

async def getUrl(url):
    async with aiohttp.ClientSession() as session:
        #Due to need science Online, so you need a local agent
        async with session.get(url,proxy='http://127.0.0.1:1080') as resp:
            wb_data = await resp.text()
            soup = BeautifulSoup(wb_data,'lxml')
    return soup

download voice function asynchronously:

async def download(url,filename,title):
    file_name = './Five thousand years of China/{0}/{1}'.format(title,filename + '.mp3') 
    async with aiohttp.ClientSession() as session:
        async with session.get(url,proxy='http://127.0.0.1:1080') as resp:
            with open(file_name, 'wb') as fd:
                while True:
                    chunk = await resp.content.read()
                    if not chunk:
                        break
                    fd.write(chunk)

Thanks to the asynchronous IO mode, you can download one page soon.

Posted by coelex on Tue, 12 Nov 2019 11:38:44 -0800

Programmer Group

Simple application of beautiful soup and aiohttp crawling resources

Analyze Html page

code implementation

Hot Keywords