Spark saves the output data to an existing directory
Save by the saveAsTextFile method of RDD
String input = "args[0]"; String output = "args[1]"; SparkConf conf = new SparkConf(); JavaSparkContext jsc = new JavaSparkContext(conf); JavaRDD<String> rdd = jsc.textFile(input); JavaPairRDD<String, String> pairRdd = rdd.mapToPair(new PairFunction<String, String, String>() { private static final long serialVersionUID = -4327506895954890835L; @Override public Tuple2<String, String> call(String t) throws Exception { String[] lines = t.split(","); return new Tuple2<String,String>(lines[0],lines[1]); } }); pairRdd.saveAsTextFile(output);
If the output directory already exists, an exception is reported:
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /output already exists at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1053) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:954) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:863) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1290) at org.apache.spark.api.java.JavaRDDLike$class.saveAsTextFile(JavaRDDLike.scala:497) at org.apache.spark.api.java.AbstractJavaRDDLike.saveAsTextFile(JavaRDDLike.scala:46) at com.mapbar.spark.outdir.SparkOutputFormat.main(SparkOutputFormat.java:38)
Solution:
String input = "args[0]"; String output = "args[1]"; SparkConf conf = new SparkConf(); JavaSparkContext jsc = new JavaSparkContext(conf); JavaRDD<String> rdd = jsc.textFile(input); JavaPairRDD<String, String> pairRdd = rdd.mapToPair(new PairFunction<String, String, String>() { private static final long serialVersionUID = -4327506895954890835L; @Override public Tuple2<String, String> call(String t) throws Exception { String[] lines = t.split(","); return new Tuple2<String,String>(lines[0],lines[1]); } }); pairRdd.saveAsHadoopFile(output, String.class, String.class, RDDMultipleTextOutputFormat.class);
RDDMultipleTextOutputFormat.class
public class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat<String, String> { /** * key:RDD The key value: RDD's value name: the number of each Reducer return < String >: file name */ @Override protected String generateFileNameForKeyValue(String key, String value, String name) { /** * spark Output a file to an existing directory and overwrite it if the file name already exists */ return key+"/"+key+"-"+new Date().getTime()+".csv"; } @Override public void checkOutputSpecs(FileSystem ignored, JobConf job) throws FileAlreadyExistsException, InvalidJobConfException, IOException { Path outDir = getOutputPath(job); if (outDir == null && job.getNumReduceTasks() != 0) { throw new InvalidJobConfException( "Output directory not set in JobConf."); } if (outDir != null) { FileSystem fs = outDir.getFileSystem(job); // normalize the output directory outDir = fs.makeQualified(outDir); setOutputPath(job, outDir); // get delegation token for the outDir's file system TokenCache.obtainTokensForNamenodes(job.getCredentials(), new Path[] { outDir }, job); //Make spark's output directory available // check its existence /*if (fs.exists(outDir)) { throw new FileAlreadyExistsException("Output directory " + outDir + " already exists"); }*/ } } }
Save RDD in multi-file output format. Rewriting the checkOutputSpecs method will comment out the part of the output directory that is judged.
If the name of the output file exists in an existing directory, it will be overwritten.
So override the generateFileNameForKeyValue method to rename the output file.
Reproduced in: https://my.oschina.net/asparagus/blog/699814