Spark-SQL load and save operations

Case study:

Spark SQL load, save Case Practice - > Generic load Save

load and save operations:

For Spark SQL DataFrame, no matter what data source is created from, there are some common load and save operations. The load operation is mainly used to load data and create a DataFrame; the save operation is mainly used to save the data in the DataFrame to a file.

Java Edition
DataFrame df = sqlContext.read().load("users.parquet");
df.select("name", "favorite_color").write().save("namesAndFavColors.parquet");

Scala Edition
val df = sqlContext.read.load("users.parquet")
df.select("name", "favorite_color").write.save("namesAndFavColors.parquet")

Manually specify the type of data source: (transformations between different data sources can be achieved)

You can also manually specify the type of data source used for the operation. Data sources are usually specified by their fully qualified names, such as parquet, which is org.apache.spark.sql.parquet. But Spark SQL has built-in data source types, such as json, parquet, jdbc, and so on. In fact, with this function, you can convert between different types of data sources. For example, save the data in the JSON file to the parquet file. By default, if you don't specify a data source type, it's parquet.

Java Edition
DataFrame df = sqlContext.read().format("json").load("people.json");
df.select("name", "age").write().format("parquet").save("namesAndAges.parquet");

Scala Edition
val df = sqlContext.read.format("json").load("people.json")
df.select("name", "age").write.format("parquet").save("namesAndAges.parquet")

Save Mode:

Spark SQL provides different save mode s for save operations. Mainly used for processing, when the target location, there is data, how to deal with. Moreover, save operations do not perform lock operations and are not atomic, so there is a certain risk of dirty data.

Save Mode	Significance
SaveMode.ErrorIfExists (default)	If the target location already has data, throw an exception
SaveMode.Append	If data already exists in the target location, add the data in
SaveMode.Overwrite	If data already exists in the target location, the existing data is deleted and overwritten with new data.
SaveMode.Ignore	If data already exists in the target location, ignore it and do nothing.

Case study:

Java version:

package Spark_SQL;

import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
import org.codehaus.janino.Java;

/**
 * @Date: 2019/3/14 14:13
 * @Author Angle
 */
/*
* Spark SQL The Case of load, save
*
* */

public class GenericLoadSave {
    public static void main(String[] args){

        SparkConf conf = new SparkConf().setAppName("GenericLoadSave").setMaster("local");
        JavaSparkContext sc = new JavaSparkContext(conf);

        SQLContext sqlContext = new SQLContext(sc);
//        Dataset<Row> userDF = sqlContext.read().load("E:\\IDEA\\textFile\\users.parquet");
//        userDF.select("name","favorite_color").write()
//                .save("E:\\IDEA\\textFile\\nameANDcolor_java");

        Dataset<Row> userDF = sqlContext.read().load("hdfs://master:9000/users.parquet");
        userDF.select("name","favorite_color").write()
                .save("hdfs://master:9000/nameANDcolor_java");
        userDF.show();

    }
}

scala version:

package SparkSQL_Scala

import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}

/**
  * @Date: 2019/3/14 14:24
  * @Author Angle
  */
object GenericLoadSave_s {
  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setAppName("GenericLoadSave_s").setMaster("local");
    val sc = new SparkContext(conf)

    val sqlContext = new SQLContext(sc)

    val userDF = sqlContext.read.load("E:\\IDEA\\textFile\\users.parquet")

    userDF.write.save("E:\\IDEA\\textFile\\users.parquet_Scala")
    userDF.show()
  }
}

Posted by Aurasia on Sun, 05 May 2019 21:50:39 -0700

Programmer Group

Spark-SQL load and save operations

Case study:

load and save operations:

Manually specify the type of data source: (transformations between different data sources can be achieved)

Save Mode:

Case study:

Hot Keywords