Record a different implementation of a streaming website and crawl its pit with a Python crawler

I found a movie today and want to download it.

Start with the Networks tool to analyze:

Preliminary analysis found that TS format file will be pulled when video is loaded, presumably this is an index of m3u8, which records hundreds of TS files, so that it is easy to load when fast-forward.

But when you actually analyze the m3u8 file, you find that this is not a valid index file, it should just load a form, the actual handler is somewhere else:

But analyzing js like this is too cumbersome.Through several attempts, we found the rule: the video file name is composed of y8TL59oh4680xxx.ts, XXX is the serial number, so it is much simpler!

Change the crawler that crawled music files before to get such a program:

import requests
import os
import re
from tkinter import Tk
from tkinter.simpledialog import askinteger, askfloat, askstring
from tkinter.filedialog import askopenfilename, askopenfilenames, asksaveasfilename, askdirectory
from tkinter.messagebox import showinfo, showwarning, showerror

def downloadSong(SongID, FileName):
    headers = {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36"}
    r = requests.get("https://www.mmicloud.com/20190406/I1RrJf8s/2000kb/hls/y8TL59oh" + str(SongID) + ".ts",headers=headers);
    #print("State:")
    #print(r)
    filepath=os.path.join(str(SongID) + ".ts")
    with open(filepath,"wb") as file:
        file.write(r.content)
    print(SongID)

for i in range(4680000, 4680900):
    downloadSong(i, str(i))

This program loops through 900 video files whose filenames range from y8TL59oh4680000.ts to y8TL59oh4680899.ts.

The maximum value of the loop in the program was set at 468000 because I found that there are more than 860 segments of the movie, so I downloaded more. If the download is not finished, it doesn't matter if the error occurs.

Let him start running, it looks like he's working well and has a smooth download of files:

So I put down what I was doing and rested first.After about half an hour, he has downloaded more than 300 files:

I was reassured that this crawler should be okay, so I wrote some code with VSCode.When I see the taskbar again, the crawler is gone!

I start the crawl again, and after a while the same problem will happen again!Is variable I overflowing?Try debug to narrow down the scope of i:

import requests
import os
import re
from tkinter import Tk
from tkinter.simpledialog import askinteger, askfloat, askstring
from tkinter.filedialog import askopenfilename, askopenfilenames, asksaveasfilename, askdirectory
from tkinter.messagebox import showinfo, showwarning, showerror

def downloadSong(SongID, FileName):
    headers = {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36"}
    r = requests.get("https://www.mmicloud.com/20190406/I1RrJf8s/2000kb/hls/y8TL59oh4680" + str(SongID) + ".ts",headers=headers);
    #print("State:")
    #print(r)
    filepath=os.path.join(str(SongID) + ".ts")
    with open(filepath,"wb") as file:
        file.write(r.content)
    print(SongID)

for i in range(566, 900):
    downloadSong(i, str(i))

After debug, it was found that the program should be OK, just because when the console window is minimized, the crawler will be reclaimed from memory and cause the program to exit.

Torn for half a day!

Instead of using the Run Modules that come with the IDLE editor, it's not easy to recycle them if you have normal windows:

After a while, the crawler finally finished crawling the file.Looking at the folder, something went wrong again:

Inconsistent file names!

Remember when we debug, we narrowed the range of variable i?That's why!

Well, select all files with long names, right-click, rename, and rename them as a. Then the files can be automatically named a (1), a (2), a (3), a (4), a (5),...

Question.Solved?

I took these files named a (1), a (2), a (3), a (4), a (5),... To transcode, merge, and go back and forth for more than an hour.After the merge, it was discovered that,

File order is all messy!!!

Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah Ah

No way, I can't breathe out, so I have to continue writing code.

Fortunately, I have left a folder that has not been renamed, so write a batch Renamer in python:

import os
PROJECT_DIR_PATH = os.path.dirname(os.path.abspath(os.path.abspath(__file__)))
DIR_PATH = os.path.join(PROJECT_DIR_PATH, 'data')
files = os.listdir(DIR_PATH)
for filename in files:
    name, suffix = os.path.splitext(filename)
    new_name = os.path.join(DIR_PATH, name[4:7])
    old_name = os.path.join(DIR_PATH, filename)
    os.rename(old_name, new_name)

Change the file directory so that you can use the above program:

After running the program comfortably, I found that the naming was successful, but the suffix name was gone.

Mistakes!Write another remedy:

import os
PROJECT_DIR_PATH = os.path.dirname(os.path.abspath(os.path.abspath(__file__)))
DIR_PATH = os.path.join(PROJECT_DIR_PATH, 'data')
files = os.listdir(DIR_PATH)
for filename in files:
    name, suffix = os.path.splitext(filename)
    new_name = os.path.join(DIR_PATH, filename + ".ts")
    old_name = os.path.join(DIR_PATH, filename)
    os.rename(old_name, new_name)

After a frightening run, the catalog finally worked:

Then transcode, merge, and more than an hour.Finally, the fruits of the victory were achieved:

How difficult!

It took me a whole day to download the movie.Find movie sources in the morning and noon, write code + crawl + crawl resources in the afternoon, and worry about renaming and transcoding in the evening, which is enough for me to see 6-7 movies.Epsilon=(_'*)) Ouch.

Not to mention, the movie can only be seen tomorrow.Ladies and gentlemen, good night!

Posted by soulmedia on Sun, 05 Apr 2020 01:43:38 -0700

Programmer Group

Record a different implementation of a streaming website and crawl its pit with a Python crawler

File order is all messy!!!

Hot Keywords