Three methods of reading large files by batch in java

Keywords: Java

1. Difficulty of reading large files in Java
The general operation of java reading files is to read all the file data into memory, and then operate on the data. for example

Path path = Paths.get("file path");
byte[] data = Files.readAllBytes(path);

This is no problem for small files, but exceptions will be thrown for slightly larger files

Exception in thread "main" java.lang.OutOfMemoryError: Required array size too large
at java.nio.file.Files.readAllBytes(Files.java:3156)

From the error location, the Files.readAllBytes method supports integer.max at most_ Value - 8-size files, i.e. files up to 2GB. Once this limit is exceeded, java Native methods cannot be used directly.

2. Read large files in batches
Since you can't directly read all large files into memory, you should divide the file into multiple sub regions and read multiple times. There are many ways to use this.

(1) File byte stream
Create a java.io.BufferedInputStream for the file. Each time the read() method is called, the data with the length of arraySize in the file will be successively fetched into the array. This method is feasible but inefficient.

import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.IOException;

/**
* Created by zfh on 16-4-19.
*/
public class StreamFileReader {
private BufferedInputStream fileIn;
private long fileLength;
private int arraySize;
private byte[] array;

public StreamFileReader(String fileName, int arraySize) throws IOException {
this.fileIn = new BufferedInputStream(new FileInputStream(fileName), arraySize);
this.fileLength = fileIn.available();
this.arraySize = arraySize;
}

public int read() throws IOException {
byte[] tmpArray = new byte[arraySize];
int bytes = fileIn.read(tmpArray);// Temporarily stored in byte array
if (bytes != -1) {
array = new byte[bytes];// Byte array length is read length
System.arraycopy(tmpArray, 0, array, 0, bytes);// Copy read data
return bytes;
}
return -1;
}

public void close() throws IOException {
fileIn.close();
array = null;
}

public byte[] getArray() {
return array;
}

public long getFileLength() {
return fileLength;
}

public static void main(String[] args) throws IOException {
StreamFileReader reader = new StreamFileReader("/home/zfh/movie.mkv", 65536);
long start = System.nanoTime();
while (reader.read() != -1) ;
long end = System.nanoTime();
reader.close();
System.out.println("StreamFileReader: " + (end - start));
}
}

 

(2) File channel
Create a java.nio.channels.FileChannel for the file. Each time the read() method is called, the file data will be read into the java.nio.ByteBuffer with the allocated length of arraySize, and then the read file data will be converted into array. This method, which uses the channel in NIO, is faster than the traditional byte stream to read files.

import java.io.FileInputStream;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;

/**
* Created by zfh on 16-4-18.
*/
public class ChannelFileReader {
private FileInputStream fileIn;
private ByteBuffer byteBuf;
private long fileLength;
private int arraySize;
private byte[] array;

public ChannelFileReader(String fileName, int arraySize) throws IOException {
this.fileIn = new FileInputStream(fileName);
this.fileLength = fileIn.getChannel().size();
this.arraySize = arraySize;
this.byteBuf = ByteBuffer.allocate(arraySize);
}

public int read() throws IOException {
FileChannel fileChannel = fileIn.getChannel();
int bytes = fileChannel.read(byteBuf);// Read ByteBuffer in
if (bytes != -1) {
array = new byte[bytes];// Byte array length is read length
byteBuf.flip();
byteBuf.get(array);// from ByteBuffer Get byte array from
byteBuf.clear();
return bytes;
}
return -1;
}

public void close() throws IOException {
fileIn.close();
array = null;
}

public byte[] getArray() {
return array;
}

public long getFileLength() {
return fileLength;
}

public static void main(String[] args) throws IOException {
ChannelFileReader reader = new ChannelFileReader("/home/zfh/movie.mkv", 65536);
long start = System.nanoTime();
while (reader.read() != -1) ;
long end = System.nanoTime();
reader.close();
System.out.println("ChannelFileReader: " + (end - start));
}
}

(3) Memory file mapping
This method is to image the contents of the file to an area of the computer virtual memory, so that the data in the memory can be directly operated without reading the file from the physical hard disk through I/O every time. This is the process from the current java state to the operating system kernel state, the operating system reads the file, and then returns the data to the current java state. In this way, we can greatly improve the speed of operating large files.

import java.io.FileInputStream;
import java.io.IOException;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;

/**
* Created by zfh on 16-4-19.
*/
public class MappedFileReader {
private FileInputStream fileIn;
private MappedByteBuffer mappedBuf;
private long fileLength;
private int arraySize;
private byte[] array;

public MappedFileReader(String fileName, int arraySize) throws IOException {
this.fileIn = new FileInputStream(fileName);
FileChannel fileChannel = fileIn.getChannel();
this.fileLength = fileChannel.size();
this.mappedBuf = fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileLength);
this.arraySize = arraySize;
}

public int read() throws IOException {
int limit = mappedBuf.limit();
int position = mappedBuf.position();
if (position == limit) {
return -1;
}
if (limit - position > arraySize) {
array = new byte[arraySize];
mappedBuf.get(array);
return arraySize;
} else {// Last read data
array = new byte[limit - position];
mappedBuf.get(array);
return limit - position;
}
}

public void close() throws IOException {
fileIn.close();
array = null;
}

public byte[] getArray() {
return array;
}

public long getFileLength() {
return fileLength;
}

public static void main(String[] args) throws IOException {
MappedFileReader reader = new MappedFileReader("/home/zfh/movie.mkv", 65536);
long start = System.nanoTime();
while (reader.read() != -1);
long end = System.nanoTime();
reader.close();
System.out.println("MappedFileReader: " + (end - start));
}
}

It seems that the problem has been solved perfectly. We will certainly use the memory file mapping method to deal with large files. However, the running results show that this method still cannot read more than 2GB of files. It is clear that the file length passed by the FileChannel.map() method is of type long. How is it different from integer. Max_ Does value matter?

Exception in thread "main" java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:868)

From the wrong location, you can see

size - The size of the region to be mapped; must be non-negative and no greater than Integer.MAX_VALUE

This can be attributed to some historical reasons and the depth of int type in Java, but in essence, because java.nio.MappedByteBuffer is directly inherited from java.nio.ByteBuffer, and the index variable of the latter is int type, the former can only be maximally indexed to Integer.MAX_VALUE location. In this case, are we helpless? Of course not. One memory file mapping is not enough, so try multiple.

import java.io.FileInputStream;
import java.io.IOException;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;

/**
* Created by zfh on 16-4-19.
*/
public class MappedBiggerFileReader {
private MappedByteBuffer[] mappedBufArray;
private int count = 0;
private int number;
private FileInputStream fileIn;
private long fileLength;
private int arraySize;
private byte[] array;

public MappedBiggerFileReader(String fileName, int arraySize) throws IOException {
this.fileIn = new FileInputStream(fileName);
FileChannel fileChannel = fileIn.getChannel();
this.fileLength = fileChannel.size();
this.number = (int) Math.ceil((double) fileLength / (double) Integer.MAX_VALUE);
this.mappedBufArray = new MappedByteBuffer[number];// Memory file mapping array
long preLength = 0;
long regionSize = (long) Integer.MAX_VALUE;// The size of the mapping area
for (int i = 0; i < number; i++) {// Maps contiguous areas of files to memory file mapping arrays
if (fileLength - preLength < (long) Integer.MAX_VALUE) {
regionSize = fileLength - preLength;// The size of the last area
}
mappedBufArray[i] = fileChannel.map(FileChannel.MapMode.READ_ONLY, preLength, regionSize);
preLength += regionSize;// The beginning of the next area
}
this.arraySize = arraySize;
}

public int read() throws IOException {
if (count >= number) {
return -1;
}
int limit = mappedBufArray[count].limit();
int position = mappedBufArray[count].position();
if (limit - position > arraySize) {
array = new byte[arraySize];
mappedBufArray[count].get(array);
return arraySize;
} else {// The last read data of this memory file mapping
array = new byte[limit - position];
mappedBufArray[count].get(array);
if (count < number) {
count++;// Convert to next memory file mapping
}
return limit - position;
}
}

public void close() throws IOException {
fileIn.close();
array = null;
}

public byte[] getArray() {
return array;
}

public long getFileLength() {
return fileLength;
}

public static void main(String[] args) throws IOException {
MappedBiggerFileReader reader = new MappedBiggerFileReader("/home/zfh/movie.mkv", 65536);
long start = System.nanoTime();
while (reader.read() != -1) ;
long end = System.nanoTime();
reader.close();
System.out.println("MappedBiggerFileReader: " + (end - start));
}
}

 

3. Comparison of operation results
Use the above three methods to read 1GB files, and the running results are as follows

StreamFileReader: 11494900386
ChannelFileReader: 11329346316
MappedFileReader: 11169097480

Read 10GB file, and the running results are as follows

StreamFileReader: 194579779394
ChannelFileReader: 190430242497
MappedBiggerFileReader: 186923035795

 

--------
Original link: https://blog.csdn.net/zhufenghao/article/details/51192043

Posted by Luvac Zantor on Mon, 01 Nov 2021 02:39:27 -0700