Python crawler introductory tutorial 51-100 Python 3 crawler downloads ts video through m3u8 file - Python crawler 6 operation

Keywords: Python Attribute network encoding REST

What is m3u8 file

M3U8 file refers to M3U file in UTF-8 encoding format.
M3U file is an index plain text file.
When it is opened, the playback software does not play it, but finds the corresponding network address of audio and video files according to its index for online playback.

The original video data is divided into many TS streams, and the address of each TS stream is recorded in the m3u8 file list.

For example, I have an m3u8 file here, which reads as follows

#EXTM3U
#EXT-X-VERSION:3
#EXT-X-MEDIA-SEQUENCE:0
#EXT-X-ALLOW-CACHE:YES
#EXT-X-TARGETDURATION:15
#EXTINF:6.916667,
out000.ts
#EXTINF:10.416667,
out001.ts
#EXTINF:10.416667,
out002.ts
#EXTINF:1.375000,
out003.ts
#EXTINF:1.541667,
out004.ts
#EXTINF:7.666667,
out005.ts
#EXTINF:10.416667,

How to deal with ts file in general

  • Only m3u8 files need to be downloaded
  • There are ts files, but because they are encrypted and cannot be played, they need to be decoded.
  • ts file can play normally, but too many and too small, need to be merged

This article deals with Articles 1 and 2, and the encryption part skips.

The ts file I provided above is not encrypted, that is, there is no keyword key. After downloading the ts file, it can be merged directly.

ts file path acquisition

Since all ts files in the m3u8 file above are relative addresses, it is necessary to base on Last blog Links obtained in

{'url': 'https://videos5.jsyunbf.com/2019/02/07/iQX7y3p1dleAhIv7/playlist.m3u8', 'ext': 'dplay', 'msg': 'ok', 'playertype': None}

The preceding part is the prefix address of ts playback address.

# https://videos5.jsyunbf.com/2019/02/07/iQX7y3p1dleAhIv7/out005.ts
import datetime
import requests
# m3u8 is the local file path
def get_ts_urls(m3u8_path,base_url):  
    urls = []
    with open(m3u8_path,"r") as file:
        lines = file.readlines()
        for line in lines:
            if line.endswith(".ts\n"):
                urls.append(base_url+line.strip("\n"))

    return urls

ts file download

After reading all the paths, the ts file needs to be downloaded. There are many ways to download the files.

def download(ts_urls,download_path):
    for i in range(len(ts_urls)):
        ts_url = ts_urls[i]
        file_name = ts_url.split("/")[-1]
        print("Start downloading %s" %file_name)
        start = datetime.datetime.now().replace(microsecond=0)
        try:
            response = requests.get(ts_url,stream=True,verify=False)
        except Exception as e:
            print("Exception request:%s"%e.args)
            return

        ts_path = download_path+"/{0}.ts".format(i)
        with open(ts_path,"wb+") as file:
            for chunk in response.iter_content(chunk_size=1024):
                if chunk:
                    file.write(chunk)

        end = datetime.datetime.now().replace(microsecond=0)
        print("Time consuming:%s"%(end-start))

The download process shows that the download is successful, and the rest is the time to speed up the network.

After downloading, it's a bunch of ts files. Remember, as long as one can be seen, it can be merged.

Merge ts files

If you don't know the copy command, go to Baidu.

copy/b D:\newpython\doutu\sao\ts_files\*.ts d:\fnew.ts

Code merging

import os
from os import path
def file_walker(path):
    file_list = []
    for root, dirs, files in os.walk(path): # generator
        for fn in files:
            p = str(root+'/'+fn)
            file_list.append(p)

    print(file_list)
    return file_list

def combine(ts_path, combine_path, file_name):
    file_list = file_walker(ts_path)
    file_path = combine_path + file_name + '.ts'
    with open(file_path, 'wb+') as fw:
        for i in range(len(file_list)):

            fw.write(open(file_list[i], 'rb').read())

if __name__ == '__main__':
    #urls = get_ts_urls("playlist.m3u8","https://videos5.jsyunbf.com/2019/02/07/iQX7y3p1dleAhIv7/")
    #download(urls,"./tsfiles")
    combine("./ts_files","d:/ts","haha")

After the final merge, a ts file is formed. Of course, you can also use software to convert video into mp4 format.

FFMPEG can also be used to convert m3u8 to MP4 directly.

Happy to download and watch VIP videos

Remarks section

M3u8 label and attribute description in m3u8 file

#EXTM3U
 The first line of each M3U file must be this tag. Please mark the function.

#EXT-X-VERSION:3 
Can this attribute be or not?

#EXT-X-MEDIA-SEQUENCE:140651513 
Each media URI has only a unique serial number in PlayList, and the serial number between adjacent media URIs is + 1. 
A media URI does not have to be included. If not, it defaults to 0.

 #EXT-X-TARGETDURATION
 Specify the maximum media duration (seconds). So the specified length of time in # EXTINF must be less than or equal to this
 Four maximum values. This tag can only appear once in the entire PlayList file (in nested cases, there are usually
 The tag does not appear until the m3u8 of the real ts url

#EXT-X-PLAYLIST-TYPE
 Provides information about the variability of PlayList, which is valid for the entire PlayList file and is optional in format
 The following:  EXT-X-PLAYLIST-TYPE:: If VOD, the server can not change the PlayList file;
If it's EVENT, the server can't change or delete any part of the PlayList file, but it can do so to
 Add a new line to the file.

#EXTINF
 Duration specifies the duration (seconds) of each media segment (ts), which is valid only for the URI following it, and title is
 url for downloading resources

#EXT-X-KEY
 Represents how to decode media segments. Its scope of action is all media before the next tag appears. 
URI, attribute NONE or AES-128. NONE denotes URI and IV (Initialization) 
Vector attribute must not exist, AES-128 (Advanced Encryption Standard) represents URI
 It must exist. IV can not exist.

#EXT-X-PROGRAM-DATE-TIME
 Associate an absolute time or date with the first sample in a media segment, only for the next meida 
URI s are valid in formats such as # EXT-X-PROGRAM-DATE-TIME:
For example: #EXT-X-PROGRAM-DATETIME:2010-02-19T14:54:23.031+08:00

#EXT-X-ALLOW-CACHE
 Is caching allowed? This can appear anywhere in the PlayList file and at most once.
The effect is all the media segments. The format is as follows: #EXT-X-ALLOW-CACHE:

#EXT-X-ENDLIST
 Represents the end of the PlayList. It can appear anywhere in the PlayList, but only one.
The formula is as follows: #EXT-X-ENDLIST

Posted by psunshine on Mon, 18 Mar 2019 00:45:27 -0700