Crawler series: storing media files

The last issue explained: Use API

This issue of crawler series mainly explains how to store and how to store data after the crawler collects data.

Although it is interesting to display the results on the command line, as the data increases and needs to be analyzed, printing the data to the command line is not the way. In order to use most web crawlers remotely, you also need to store the collected data.

The data storage method introduced in this article is applicable to most applications. If you are going to create a back-end service of a website or create your own API, you may need to write the data to the database. If you need a quick and easy way to collect online documents and save them to your hard disk, you may need to create a file stream to achieve it.

Storage media files

There are two main ways to store media files: only get the file URL link, or download the source file directly. You can reference it directly through the URL link where the media file is located. The advantages of this are as follows:

Crawlers run faster and consume less traffic, because as long as they link, they don't need to download files;
You can save a lot of storage space, because you only need to store URL links;
The code for storing the URL is easier to write and does not need to download the file code;
Not downloading files can reduce the load on the target server.

Disadvantages of saving media files:

These external URL links embedded in your website or application are called hotlinking. The use of hotlinking may cause you constant trouble. Each website will implement anti-theft chain measures;
Because your link is on someone else's server, your application runs at the rhythm of others;
Chain theft is easy to change. If you put the stolen chain picture on your blog, if it is found by the other party's server, it may be spoofed. If you store the URL link for later use, the link may have expired or become completely irrelevant content;
In reality, browsers will not only request HTML pages and switch pages, but also download and access all resources on the page. Downloading files will make your crawler look more like a person browsing the website, which is beneficial.

If you are still hesitant about whether to store files or just store URL links of files, you can think about whether these files should be used many times or just wait for "ash" after they are put into the database and will never be opened again. If the answer is the latter, it's best to store only the URLs of these files. If the answer is the former, go on.

import requests

from utils import connection_util


class SaveData(object):
    def __init__(self):
        self._target_url = 'https://www.pdflibr.com'
        self._init_connection = connection_util.ProcessConnection()

    def save_image(self):
        # Connect to the target website to get content
        get_content = self._init_connection.init_connection(self._target_url)
        if get_content:
            imageLocation = get_content.find("img", {"alt": "IP to Location"})["data-src"]
            real_path = self._target_url + imageLocation
            r = requests.get(real_path)
            with open("ip_location.png", 'wb') as f:
                f.write(r.content)


if __name__ == "__main__":
    SaveData().save_image()

This program starts from IP query - Crawler identification Download a picture and save it in the folder where the program runs.

If you only need to download a file and know how to get it and its file type, this is OK. But most crawlers download only one file a day. The following program will IP query - Crawler identification Download all src attribute files on:

import os.path
from urllib.request import urlretrieve
from utils import connection_util


class GetAllSrc(object):
    def __init__(self):
        self._init_download_dir = 'downloaded'
        self._baseUrl = 'https://www.pdflibr.com/ip'
        self._init_connection = connection_util.ProcessConnection()

    def get_absolute_url(self, baseUrl, source):
        if source.startswith("https://image."):
            url = "https://" + source[14:]
        elif source.startswith("https://"):
            url = source
        elif source.startswith("www."):
            url = "https://" + source[4:]
        else:
            url = source
        if baseUrl not in url:
            return None
        return url

    def get_download_path(self, baseUrl, absoluteUrl, download_dir):
        path = absoluteUrl.replace("www.", "")
        path = path.replace(baseUrl, "")
        path = download_dir + path
        directory = os.path.dirname(path)

        if not os.path.exists(directory):
            os.makedirs(directory)

        return path

    def download_main(self):
        get_content = self._init_connection.init_connection(self._baseUrl)
        if get_content:
            download_list = get_content.findAll(src=True)
            for download in download_list:
                file_url = self.get_absolute_url(self._baseUrl, download["src"])
                if file_url is not None:
                    print(file_url)
                    urlretrieve(file_url, self.get_download_path(self._baseUrl, file_url, self._init_download_dir))


if __name__ == '__main__':
    GetAllSrc().download_main()

To run the above code, you should pay attention to:

This program will download all the files on the page to your hard disk. It may contain some bash scripts,. exe files, and the settings may be malware.

The program first selects all tags with src attribute on the page, then cleans and standardizes the URL links to obtain the absolute path of the file (and removes the outer chain). Finally, each file will be downloaded to the downloaded file where the program is located.

Here, the os module of Python is used to obtain the target folder of each downloaded file and establish a complete path. os module is the interface between Python and the operating system. It can operate file paths, create directories, obtain the information of running processes and environment variables, and other system related operations.

The article source code is hosted in Github: Crawler series: storing media files

Posted by peter11 on Fri, 05 Nov 2021 20:03:32 -0700

Programmer Group

Crawler series: storing media files

Storage media files

Hot Keywords