Spark textfile reads HDFS file partitions [compressed and uncompressed]

Keywords: Big Data Spark Hadoop codec Apache

Spark textfile reads HDFS file partitions [compressed and uncompressed]

	sc.textFile("/blabla/{*.gz}")

When we create spark context and use textfile to read files, what partition is it based on? What is the partition size?

  • Compressed format of files
  • File size and HDFS block size

textfile will create a Hadoop RDD that uses the TextInputFormat class to determine how to partition
For each RDD, the getPartitions() function is used to split files. Here is the implementation of the HadoopRDD getPartitions function

  override def getPartitions: Array[Partition] = {
    val jobConf = getJobConf()
    // add the credentials here as this can be called before SparkContext initialized
    SparkHadoopUtil.get.addCredentials(jobConf)
    val inputFormat = getInputFormat(jobConf)
    val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
    val array = new Array[Partition](inputSplits.size)
    for (i <- 0 until inputSplits.size) {
      array(i) = new HadoopPartition(id, i, inputSplits(i))
    }
    array
  }

As you can see, the function inputFormat.getSplits is actually called to split files.
As you can see from the code, HadoopRDD uses the function TextInputFormat.getSplits, which is actually FileInputFormat.getSplits.
The code is as follows:

public InputSplit[] getSplits(JobConf job, int numSplits)
    throws IOException {
    Stopwatch sw = new Stopwatch().start();
    FileStatus[] files = listStatus(job);
    
    // Save the number of input files for metrics/loadgen
    job.setLong(NUM_INPUT_FILES, files.length);
    long totalSize = 0;                           // compute total size
    for (FileStatus file: files) {                // check we have valid files
      if (file.isDirectory()) {
        throw new IOException("Not a file: "+ file.getPath());
      }
      totalSize += file.getLen();
    }

    long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);
    long minSize = Math.max(job.getLong(org.apache.hadoop.mapreduce.lib.input.
      FileInputFormat.SPLIT_MINSIZE, 1), minSplitSize);

    // generate splits
    ArrayList<FileSplit> splits = new ArrayList<FileSplit>(numSplits);
    NetworkTopology clusterMap = new NetworkTopology();
    for (FileStatus file: files) {
      Path path = file.getPath();
      long length = file.getLen();
      if (length != 0) {
        FileSystem fs = path.getFileSystem(job);
        BlockLocation[] blkLocations;
        if (file instanceof LocatedFileStatus) {
          blkLocations = ((LocatedFileStatus) file).getBlockLocations();
        } else {
          blkLocations = fs.getFileBlockLocations(file, 0, length);
        }
        if (isSplitable(fs, path)) {
          long blockSize = file.getBlockSize();
          long splitSize = computeSplitSize(goalSize, minSize, blockSize);

          long bytesRemaining = length;
          while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
            String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations,
                length-bytesRemaining, splitSize, clusterMap);
            splits.add(makeSplit(path, length-bytesRemaining, splitSize,
                splitHosts[0], splitHosts[1]));
            bytesRemaining -= splitSize;
          }

          if (bytesRemaining != 0) {
            String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations, length
                - bytesRemaining, bytesRemaining, clusterMap);
            splits.add(makeSplit(path, length - bytesRemaining, bytesRemaining,
                splitHosts[0], splitHosts[1]));
          }
        } else {
          String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations,0,length,clusterMap);
          splits.add(makeSplit(path, 0, length, splitHosts[0], splitHosts[1]));
        }
      } else { 
        //Create empty hosts array for zero length files
        splits.add(makeSplit(path, 0, length, new String[0]));
      }
    }
    sw.stop();
    if (LOG.isDebugEnabled()) {
      LOG.debug("Total # of splits generated by getSplits: " + splits.size()
          + ", TimeTaken: " + sw.elapsedMillis());
    }
    return splits.toArray(new FileSplit[splits.size()]);
  }

  protected long computeSplitSize(long goalSize, long minSize,
                                       long blockSize) {
    return Math.max(minSize, Math.min(goalSize, blockSize));
  }

You can see that this function requires two parameters, jobConf [this is to read the related cluster configuration], numSplits, the cut-off quantity, which is determined by the second parameter of textfile(). When we do not pass the second parameter, the default value is as follows:

def defaultMinPartitions: Int = math.min(defaultParallelism, 2)

This default Parallelism is a configuration item of spark. You can google it by yourself. Generally speaking, it's equal to 2.
The default value of org.apache.hadoop.mapreduce.lib.input.FileInputFormat.SPLIT_MINSIZE is mapreduce.input.fileinputformat.split.minsize, which is the configuration decision, by default 1,
The default minSplitSize value is also 1
So there is one as follows:

long minSize = Math.max(1, 1), 1); //minSize = 1
long goalSize = totalSize / (2 == 0 ? 1 : 2); //That is goalSize = total Size / 2

When a file is separable, the isSplitable function is used to determine whether it is separable, which is related to the compression format. I'm assuming here that it's not compressed, so it's separable.

  protected boolean isSplitable(FileSystem fs, Path file) {
    final CompressionCodec codec = compressionCodecs.getCodec(file);
    if (null == codec) {
      return true;
    }
    return codec instanceof SplittableCompressionCodec;
  }

Get HDFS blockSize

long blockSize = file.getBlockSize(); //blockSize =128M

Finally get the slice size

long splitSize = computeSplitSize(goalSize, minSize, blockSize);//Math.max(minSize, Math.min(goalSize, blockSize))
/**
* Suppose you have a 20M file, max(1, min(10,128)) splitSize = 10M, so split the file into two partitions, each 10M.
* Suppose there is a 520M file, max(1,min(260,128)) splitSize = 128M, and SPLIT_SLOP is also involved here, which is 1.1.
* (520 -128*3) = 136M  136/128 = 1.06 < 1.1 So the last 136M is the last partition, and the final file will be divided into four partitions.
*/

Summary

The work of file segmentation is actually to calculate the size of the segmentation, i. e. how many M to segment, and then divide the file into multiple parts according to this size. The final number of partition s is the number of files to be divided.

Posted by MK27 on Sat, 02 Feb 2019 17:54:15 -0800