Spark is exported to an existing directory

Links to the original text: https://my.oschina.net/asparagus/blog/699814

Spark saves the output data to an existing directory

Save by the saveAsTextFile method of RDD

String input = "args[0]";
String output = "args[1]";
SparkConf conf = new SparkConf();
JavaSparkContext jsc = new JavaSparkContext(conf);
JavaRDD<String> rdd = jsc.textFile(input);
JavaPairRDD<String, String> pairRdd = rdd.mapToPair(new PairFunction<String, String, String>() {
    private static final long serialVersionUID = -4327506895954890835L;
    @Override
    public Tuple2<String, String> call(String t) throws Exception {
        String[] lines = t.split(",");
            return new Tuple2<String,String>(lines[0],lines[1]);
    }
});
pairRdd.saveAsTextFile(output);

If the output directory already exists, an exception is reported:

Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /output already exists
    at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1053)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:954)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:863)
    at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1290)
    at org.apache.spark.api.java.JavaRDDLike$class.saveAsTextFile(JavaRDDLike.scala:497)
    at org.apache.spark.api.java.AbstractJavaRDDLike.saveAsTextFile(JavaRDDLike.scala:46)
    at com.mapbar.spark.outdir.SparkOutputFormat.main(SparkOutputFormat.java:38)

Solution:

String input = "args[0]";
String output = "args[1]";
SparkConf conf = new SparkConf();
JavaSparkContext jsc = new JavaSparkContext(conf);
JavaRDD<String> rdd = jsc.textFile(input);
JavaPairRDD<String, String> pairRdd = rdd.mapToPair(new PairFunction<String, String, String>() {
    private static final long serialVersionUID = -4327506895954890835L;
    @Override
    public Tuple2<String, String> call(String t) throws Exception {
        String[] lines = t.split(",");
            return new Tuple2<String,String>(lines[0],lines[1]);
    }
});

pairRdd.saveAsHadoopFile(output, String.class, String.class, RDDMultipleTextOutputFormat.class);

RDDMultipleTextOutputFormat.class

public class RDDMultipleTextOutputFormat extends
        MultipleTextOutputFormat<String, String> {
    /**
     * key:RDD The key value: RDD's value name: the number of each Reducer return < String >: file name
     */
    @Override
    protected String generateFileNameForKeyValue(String key, String value,
            String name) {
        /**
         * spark Output a file to an existing directory and overwrite it if the file name already exists
         */
        return  key+"/"+key+"-"+new Date().getTime()+".csv";
    }
    
    @Override
    public void checkOutputSpecs(FileSystem ignored, JobConf job)
            throws FileAlreadyExistsException, InvalidJobConfException,
            IOException {
        Path outDir = getOutputPath(job);
        if (outDir == null && job.getNumReduceTasks() != 0) {
            throw new InvalidJobConfException(
                    "Output directory not set in JobConf.");
        }
        if (outDir != null) {
            FileSystem fs = outDir.getFileSystem(job);
            // normalize the output directory
            outDir = fs.makeQualified(outDir);
            setOutputPath(job, outDir);
            // get delegation token for the outDir's file system
            TokenCache.obtainTokensForNamenodes(job.getCredentials(),
                    new Path[] { outDir }, job);
            //Make spark's output directory available
            // check its existence
            /*if (fs.exists(outDir)) {
                throw new FileAlreadyExistsException("Output directory "
                        + outDir + " already exists");
            }*/
        }
    }
}

Save RDD in multi-file output format. Rewriting the checkOutputSpecs method will comment out the part of the output directory that is judged.

If the name of the output file exists in an existing directory, it will be overwritten.

So override the generateFileNameForKeyValue method to rename the output file.

Reproduced in: https://my.oschina.net/asparagus/blog/699814

Posted by webguync on Thu, 03 Oct 2019 07:17:07 -0700

Programmer Group

Spark is exported to an existing directory

Spark saves the output data to an existing directory

Hot Keywords