Case study:
Spark SQL load, save Case Practice - > Generic load Save
load and save operations:
For Spark SQL DataFrame, no matter what data source is created from, there are some common load and save operations. The load operation is mainly used to load data and create a DataFrame; the save operation is mainly used to save the data in the DataFrame to a file.
Java Edition DataFrame df = sqlContext.read().load("users.parquet"); df.select("name", "favorite_color").write().save("namesAndFavColors.parquet"); Scala Edition val df = sqlContext.read.load("users.parquet") df.select("name", "favorite_color").write.save("namesAndFavColors.parquet")
Manually specify the type of data source: (transformations between different data sources can be achieved)
You can also manually specify the type of data source used for the operation. Data sources are usually specified by their fully qualified names, such as parquet, which is org.apache.spark.sql.parquet. But Spark SQL has built-in data source types, such as json, parquet, jdbc, and so on. In fact, with this function, you can convert between different types of data sources. For example, save the data in the JSON file to the parquet file. By default, if you don't specify a data source type, it's parquet.
Java Edition DataFrame df = sqlContext.read().format("json").load("people.json"); df.select("name", "age").write().format("parquet").save("namesAndAges.parquet"); Scala Edition val df = sqlContext.read.format("json").load("people.json") df.select("name", "age").write.format("parquet").save("namesAndAges.parquet")
Save Mode:
Spark SQL provides different save mode s for save operations. Mainly used for processing, when the target location, there is data, how to deal with. Moreover, save operations do not perform lock operations and are not atomic, so there is a certain risk of dirty data.
Save Mode |
Significance |
SaveMode.ErrorIfExists (default) |
If the target location already has data, throw an exception |
SaveMode.Append |
If data already exists in the target location, add the data in |
SaveMode.Overwrite |
If data already exists in the target location, the existing data is deleted and overwritten with new data. |
SaveMode.Ignore |
If data already exists in the target location, ignore it and do nothing. |
Case study:
Java version:
package Spark_SQL; import org.apache.spark.SparkConf; import org.apache.spark.SparkContext; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SQLContext; import org.codehaus.janino.Java; /** * @Date: 2019/3/14 14:13 * @Author Angle */ /* * Spark SQL The Case of load, save * * */ public class GenericLoadSave { public static void main(String[] args){ SparkConf conf = new SparkConf().setAppName("GenericLoadSave").setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf); SQLContext sqlContext = new SQLContext(sc); // Dataset<Row> userDF = sqlContext.read().load("E:\\IDEA\\textFile\\users.parquet"); // userDF.select("name","favorite_color").write() // .save("E:\\IDEA\\textFile\\nameANDcolor_java"); Dataset<Row> userDF = sqlContext.read().load("hdfs://master:9000/users.parquet"); userDF.select("name","favorite_color").write() .save("hdfs://master:9000/nameANDcolor_java"); userDF.show(); } }
scala version:
package SparkSQL_Scala import org.apache.spark.sql.SQLContext import org.apache.spark.{SparkConf, SparkContext} /** * @Date: 2019/3/14 14:24 * @Author Angle */ object GenericLoadSave_s { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("GenericLoadSave_s").setMaster("local"); val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) val userDF = sqlContext.read.load("E:\\IDEA\\textFile\\users.parquet") userDF.write.save("E:\\IDEA\\textFile\\users.parquet_Scala") userDF.show() } }