How to read large files in PHP

Keywords: PHP github zlib Ubuntu

As PHP developers, we don't need to worry about memory management. The PHP engine has done a good job of cleaning up behind us. The web server model of short execution context means that even the most sloppy code has no lasting impact.

 

In rare cases, we may need to get out of the comfort zone - for example, when we try to run Composer for a large project on the smallest VPS we can create, or when we need to read large files on the same small server.

 

This is an issue we will discuss in this tutorial.

The code for this tutorial can be found here GitHub.

Measure success

The only way to know if our code improvements are effective is to measure a bad situation and then compare the measures we've applied. In other words, we don't know if a solution is a solution unless we know how much, if any, it can help us.

We can focus on two indicators. The first is CPU utilization. How fast or slow is the process we are dealing with? The second is memory usage. How much memory does the script take up? These are often inversely proportional - which means we can reduce memory usage at the cost of CPU usage, and vice versa.

In an asynchronous processing model (such as a multiprocess or multithreaded PHP application), CPU and memory utilization are important considerations. In the traditional PHP architecture, any time the server limit is reached, these usually become a problem.

It is difficult to measure the CPU usage in PHP. If you do focus on this, consider using top like commands in Ubuntu or Mac OS. For Windows, consider using the Linux subsystem so you can use the top command in Ubuntu.

In this tutorial, we will measure memory usage. We'll see how much memory is used by "traditional" scripts. We will also implement some optimization strategies and measure them. Finally, I hope you can make a reasonable choice.

Here's how we look at memory usage:

 

// formatBytes method is taken from php.net document

memory_get_peak_usage();

function formatBytes($bytes, $precision = 2) {
    $units = array("b", "kb", "mb", "gb", "tb");

    $bytes = max($bytes, 0);
    $pow = floor(($bytes ? log($bytes) : 0) / log(1024));
    $pow = min($pow, count($units) - 1);

    $bytes /= (1 << (10 * $pow));

    return round($bytes, $precision) . " " . $units[$pow];
}

  

We'll use these methods at the end of the script so we know which script uses the most memory at a time.

What are our options?

We have many ways to read files effectively. There are two scenarios that will use them. We may want to read and process all the data at the same time, output the processed data or perform other operations. We may also want to transform the data flow without having to access the data.

Imagine, for the first case, if we want to read the file and hand over every 10000 rows of data to a separate queue for processing. We need to load at least 10000 rows of data into memory and then hand them over to the queue manager, whichever one we use.

For the second case, suppose we want to compress the content of an API response, which is particularly large. Although we don't care what its content is here, we need to make sure it is backed up in a compressed format.

In both cases, we need to read large files. The difference is that in the first case we need to know what the data is, while in the second case we don't care what the data is. Next, let's discuss these two approaches in depth

Read file line by line

PHP has many functions for processing files. Let's combine some of them to realize a simple file reader

// from memory.php

function formatBytes($bytes, $precision = 2) {
    $units = array("b", "kb", "mb", "gb", "tb");

    $bytes = max($bytes, 0);
    $pow = floor(($bytes ? log($bytes) : 0) / log(1024));
    $pow = min($pow, count($units) - 1);

    $bytes /= (1 << (10 * $pow));

    return round($bytes, $precision) . " " . $units[$pow];
}

print formatBytes(memory_get_peak_usage());
 

// from reading-files-line-by-line-1.php
function readTheFile($path) {
    $lines = [];
    $handle = fopen($path, "r");

    while(!feof($handle)) {
        $lines[] = trim(fgets($handle));
    }

    fclose($handle);
    return $lines;
}

readTheFile("shakespeare.txt");

require "memory.php";

  

We are reading a text file that includes all the works of Shakespeare. The file size is approximately 5.5 MB. Memory usage peaked at 12.8 MB. Now, let's use the generator to read each line:

// from reading-files-line-by-line-2.php

function readTheFile($path) {
    $handle = fopen($path, "r");

    while(!feof($handle)) {
        yield trim(fgets($handle));
    }

    fclose($handle);
}

readTheFile("shakespeare.txt");

require "memory.php";

  

The file size is the same, but the peak memory usage is 393 KB. This data is of little significance, because we need to add the processing of file data. For example, when two blank lines appear, split the document into blocks:

// from reading-files-line-by-line-3.php

$iterator = readTheFile("shakespeare.txt");

$buffer = "";

foreach ($iterator as $iteration) {
    preg_match("/\n{3}/", $buffer, $matches);

    if (count($matches)) {
        print ".";
        $buffer = "";
    } else {
        $buffer .= $iteration . PHP_EOL;
    }
}

require "memory.php";

  

Anyone guess how much memory to use this time? Even if we split the text document into 126 blocks, we still use only 459 KB of memory. Given the nature of the generator, the maximum memory we will use is the memory we need to store the largest chunk of text in the iteration. In this case, the largest block is 101985 characters.

The generator has other uses, but obviously it can read large files well. If we need to process the data, a generator may be the best way.

Pipeline between files

We can transfer file data from one file to another without processing the data. This is often called a pipe (probably because we don't see anything in the pipe except at the ends, as long as it's opaque, of course). We can realize it by stream. First, we write a script to transfer one file to another so that we can measure the memory usage:

// from piping-files-1.php

file_put_contents(
    "piping-files-1.txt", file_get_contents("shakespeare.txt")
);

require "memory.php";

  

The results were not surprising. The script uses more memory to run than its copied text file. This is because the script must read the entire file in memory until it is written to another file. For small files, this is OK. But when it comes to large files, that's not the case.

 

Let's try to stream (or pipe) from one file to another:

// from piping-files-2.php

$handle1 = fopen("shakespeare.txt", "r");
$handle2 = fopen("piping-files-2.txt", "w");

stream_copy_to_stream($handle1, $handle2);

fclose($handle1);
fclose($handle2);

require "memory.php";

  

This code is a bit strange. We open the handle of two files, the first in read mode and the second in write mode. Then we copy from the first to the second. We do this by closing both files again. You may be surprised to know that 393 KB of memory is being used. This number looks familiar. Isn't that the memory used to store line by line reads using the generator. This is because the second parameter of fgets defines the number of bytes to read per line (the default is - 1 or the length before reaching the new line). The third parameter to stream is the same (the default is exactly the same). Stream? Copy? To? Stream reads a row from one stream at a time and writes it to another. Because we don't need to process the value, it skips the part of the generator that produces the value

It's not practical to transmit text alone, so consider other examples. If we want to output image from CDN, we can use the following code to describe it

// from piping-files-3.php

file_put_contents(
    "piping-files-3.jpeg", file_get_contents(
        "https://github.com/assertchris/uploads/raw/master/rick.jpg"
    )
);

// ...or write this straight to stdout, if we don't need the memory info

require "memory.php";

  

Imagine the degree of application going to this step. This time we are not going to get images from the local file system, but from the CDN. We use file get contents instead of more elegant processing (such as Guzzle), and their actual effect is the same.

The memory usage is 581KB. Now, how can we try to stream?

// from piping-files-4.php

$handle1 = fopen(
    "https://github.com/assertchris/uploads/raw/master/rick.jpg", "r"
);

$handle2 = fopen(
    "piping-files-4.jpeg", "w"
);

// ...or write this straight to stdout, if we don't need the memory info

stream_copy_to_stream($handle1, $handle2);

fclose($handle1);
fclose($handle2);

require "memory.php";

  

Memory usage is slightly less (400 KB), but the result is the same. If we don't need memory information, we can also print to standard output. PHP provides a simple way to do this:

$handle1 = fopen(
    "https://github.com/assertchris/uploads/raw/master/rick.jpg", "r"
);

$handle2 = fopen(
    "php://stdout", "w"
);

stream_copy_to_stream($handle1, $handle2);

fclose($handle1);
fclose($handle2);

// require "memory.php";

  

Other streams

There are also streams that can be read and written through pipes.

  • php://stdin read only
  • php://stderr write only, similar to php://stdout
  • php://input is read-only, allowing us to access the original request content
  • php://output write only, let's write to the output buffer
  • php://memory and php://temp (read-write) are temporary places to store data. The difference is that when the data is large enough, php:/// temp will store the data in the file system, while php:/// memory will continue to store the data in memory until it is exhausted.

Filter

Another technique we can use for convection is called filters. It is between the two, and the data is properly controlled so that it is not exposed to the external. Let's say we want to compress the file shakespeech.txt. We can use Zip extension

// from filters-1.php

$zip = new ZipArchive();
$filename = "filters-1.zip";

$zip->open($filename, ZipArchive::CREATE);
$zip->addFromString("shakespeare.txt", file_get_contents("shakespeare.txt"));
$zip->close();

require "memory.php";

  

Although the code is neat, it uses about 10.75 MB of memory. We can use filters to optimize

// from filters-2.php

$handle1 = fopen(
    "php://filter/zlib.deflate/resource=shakespeare.txt", "r"
);

$handle2 = fopen(
    "filters-2.deflated", "w"
);

stream_copy_to_stream($handle1, $handle2);

fclose($handle1);
fclose($handle2);

require "memory.php";

  

Here, we can see the php:///filter/zlib.deflate filter, which reads and compresses the contents of resources. Then we can pipe the compressed data to another file. This uses only 896KB of memory.

Although the format is different, or using zip to compress files has many other advantages. However, you have to consider: if you choose other formats you can save 12 times of memory, will you be excited?

To decompress the data, just use another zlib filter:

// from filters-2.php

file_get_contents(
    "php://filter/zlib.inflate/resource=filters-2.deflated"
);

  

Custom flow

fopen and file get contents have their own set of default options, but they are fully customizable. To define them, we need to create a new flow context

// from creating-contexts-1.php

$data = join("&", [
    "twitter=assertchris",
]);

$headers = join("\r\n", [
    "Content-type: application/x-www-form-urlencoded",
    "Content-length: " . strlen($data),
]);

$options = [
    "http" => [
        "method" => "POST",
        "header"=> $headers,
        "content" => $data,
    ],
];

$context = stream_content_create($options);

$handle = fopen("https://example.com/register", "r", false, $context);
$response = stream_get_contents($handle);

fclose($handle);

  

In this case, we try to send a POST request to the API. The API endpoint is secure, but we still use the http context property (available for http or https). We set some headers and opened the file handle of the API. We can open the handle as read-only, and the context is responsible for writing.

There are many customized contents. If you want to learn more, you can view the corresponding File.

Create custom protocols and filters

Before concluding, let's talk about creating custom protocols. If you check File , a sample class can be found:.

Protocol {
    public resource $context;
    public __construct ( void )
    public __destruct ( void )
    public bool dir_closedir ( void )
    public bool dir_opendir ( string $path , int $options )
    public string dir_readdir ( void )
    public bool dir_rewinddir ( void )
    public bool mkdir ( string $path , int $mode , int $options )
    public bool rename ( string $path_from , string $path_to )
    public bool rmdir ( string $path , int $options )
    public resource stream_cast ( int $cast_as )
    public void stream_close ( void )
    public bool stream_eof ( void )
    public bool stream_flush ( void )
    public bool stream_lock ( int $operation )
    public bool stream_metadata ( string $path , int $option , mixed $value )
    public bool stream_open ( string $path , string $mode , int $options ,
        string &$opened_path )
    public string stream_read ( int $count )
    public bool stream_seek ( int $offset , int $whence = SEEK_SET )
    public bool stream_set_option ( int $option , int $arg1 , int $arg2 )
    public array stream_stat ( void )
    public int stream_tell ( void )
    public bool stream_truncate ( int $new_size )
    public int stream_write ( string $data )
    public bool unlink ( string $path )
    public array url_stat ( string $path , int $flags )
}

  

We're not going to implement one of them because I think it's worth having a tutorial of its own. There is a lot of work to do. But once we're done, we can easily register the flow wrapper:

if (in_array("highlight-names", stream_get_wrappers())) {
    stream_wrapper_unregister("highlight-names");
}

stream_wrapper_register("highlight-names", "HighlightNamesProtocol");

$highlighted = file_get_contents("highlight-names://story.txt");

  

You can also create custom flow filters.   File There is an example filter class:

Filter {
    public $filtername;
    public $params
    public int filter ( resource $in , resource $out , int &$consumed ,
        bool $closing )
    public void onClose ( void )
    public bool onCreate ( void )
}

  

Can be easily registered

$handle = fopen("story.txt", "w+");
stream_filter_append($handle, "highlight-names", STREAM_FILTER_READ);

  

Highlight names needs to match the filtername property of the new filter class. You can also use custom filters in the PHP: / / / filter / high names / resource = story.txt string. Defining filters is much easier than defining protocols. One reason is that the protocol needs to deal with directory operations, while the filter only needs to deal with each data block.

If you like, I strongly recommend that you try to create custom protocols and filters. If you can apply a filter to a stream ﹐ copy ﹐ to ﹐ stream operation, your application will use almost no memory, even when dealing with annoying large files. Imagine writing a resize image filter or an encryption application filter.

If you like, I strongly recommend that you try to create custom protocols and filters. If you can apply filters to stream ﹐ copy ﹐ to ﹐ stream operations, your application uses almost no memory, even when dealing with annoying large files. Imagine writing a resize image filter and an encrypt for application filter.

summary

Although this is not a problem we often encounter, it is easy to screw up when dealing with large documents. In asynchronous applications, if we don't pay attention to the memory usage, it is easy to cause the server to crash.

This tutorial hopes to bring you some new ideas (or update your inherent memory of this aspect), so that you can think more about how to effectively read and write large files. When we start to be familiar with and use streams and generators and stop using functions such as file \\\\\\\\\\.

 

For more information, please visit:

Tencent T3-T4 standard boutique PHP architect tutorial directory, as long as you read it to ensure a higher salary (continuous update)

Posted by jbrave on Tue, 14 Apr 2020 01:02:15 -0700