Spark SQL is a module for processing structured data in Spark. Unlike the underlying Park RDD API, the Spark SQL interface provides more information about the structure of data and the runtime of computing tasks. Spark SQL now has three different APIs: the SQL statement, the DataFrame API and the latest Dataset API.
One use of Spark SQL is to execute the SQL query statement directly. You can use the most basic SQL grammar, or you can choose Hive QL grammar. Spark SQL can read data from existing Live.
DataFrame is a distributed data set, each data is composed of several named fields. Conceptually, it is equivalent to a table in a relational database or a data frame in R and Python, which can be loaded and constructed from many data sources, such as structured data files, tables in Hive, external databases, or existing RDD s.
Dataset is a new API for Spark-1.6. Dataset wants to combine the advantages of RDD (strong type, lambda expression function can be used) with the advantages of Spark SQL's optimized execution engine. Dataset can be constructed by JVM objects, and then various transformation operators (map, flatMap, filter, etc.) can be used on Dataset.
Entry: SQLContext and SparkSession
Before version 2.0, all functional entries to Spark SQL were SQLContext classes and their subclasses.
val sc: SparkContext // Suppose you already have a SparkContext object val sqlContext = new org.apache.spark.sql.SQLContext(sc) // Implicit conversion operation to include RDD to DataFrame import sqlContext.implicits._
For version 2.0, the entry becomes SparkSession, which is built using SparkSession.builder().
import org.apache.spark.sql.SparkSession; SparkSession spark = SparkSession .builder() .appName("Java Spark SQL basic example") .config("spark.some.config.option", "some-value") .getOrCreate();
Spark 2.0 introduced SparkSession in order to build some features that support Hive, including using HiveQL queries, accessing Hive UDFs, reading data from Hive tables and so on. You don't need existing Hive configurations to use these features. Before that, you need to introduce HiveContext dependencies and use HiveContext to support these features.
DataFrame
Data Frame can be loaded and constructed from many data sources, such as structured data files, tables in Hive, external databases, or existing RDD s.
import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; Dataset<Row> df = spark.read().json("examples/src/main/resources/people.json"); // Displays the content of the DataFrame to stdout df.show(); // +----+-------+ // | age| name| // +----+-------+ // |null|Michael| // | 30| Andy| // | 19| Justin| // +----+-------+
After Spark 2.0, DataFrame is only Dataset of Rows (for java and Scala). DataFrame provides domain-specific language support for structured data.
// col("...") is preferable to df.col("...") import static org.apache.spark.sql.functions.col; // Print the schema in a tree format df.printSchema(); // root // |-- age: long (nullable = true) // |-- name: string (nullable = true) // Select only the "name" column df.select("name").show(); // +-------+ // | name| // +-------+ // |Michael| // | Andy| // | Justin| // +-------+ // Select everybody, but increment the age by 1 df.select(col("name"), col("age").plus(1)).show(); // +-------+---------+ // | name|(age + 1)| // +-------+---------+ // |Michael| null| // | Andy| 31| // | Justin| 20| // +-------+---------+ // Select people older than 21 df.filter(col("age").gt(21)).show(); // +---+----+ // |age|name| // +---+----+ // | 30|Andy| // +---+----+ // Count people by age df.groupBy("age").count().show(); // +----+-----+ // | age|count| // +----+-----+ // | 19| 1| // |null| 1| // | 30| 1| // +----+-----+
For a complete list of operation methods, please see Dataset's api
Dataset also supports various string, date, math and other functions, as shown in the list. Here
Programming to execute SQL queries
import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; // Register the DataFrame as a SQL temporary view df.createOrReplaceTempView("people"); Dataset<Row> sqlDF = spark.sql("SELECT * FROM people"); sqlDF.show(); // +----+-------+ // | age| name| // +----+-------+ // |null|Michael| // | 30| Andy| // | 19| Justin| // +----+-------+
Global Temporary View - The TempView created earlier is related to SparkSession, which is destroyed as the session ends. If you want to share across multiple Sessions, you need to use Global Temporary View.
// Register the DataFrame as a global temporary view df.createGlobalTempView("people"); // Global temporary view is tied to a system preserved database `global_temp` spark.sql("SELECT * FROM global_temp.people").show(); // +----+-------+ // | age| name| // +----+-------+ // |null|Michael| // | 30| Andy| // | 19| Justin| // +----+-------+ // Global temporary view is cross-session spark.newSession().sql("SELECT * FROM global_temp.people").show(); // +----+-------+ // | age| name| // +----+-------+ // |null|Michael| // | 30| Andy| // | 19| Justin| // +----+-------+
Dataset
The Dataset API is similar to RDD, but instead of Java serialization or Kryo, Dataset uses a dedicated Encoder to serialize objects and transport communications across networks. If the Encoder and standard serialization can turn objects into bytes, the Encoder can dynamically generate objects according to the code and use a special data format, which allows Spark to operate without deserialization, such as filtering, sorting, hashing, etc.
import java.util.Arrays; import java.util.Collections; import java.io.Serializable; import org.apache.spark.api.java.function.MapFunction; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.Encoder; import org.apache.spark.sql.Encoders; public static class Person implements Serializable { private String name; private int age; public String getName() { return name; } public void setName(String name) { this.name = name; } public int getAge() { return age; } public void setAge(int age) { this.age = age; } } // Create an instance of a Bean class Person person = new Person(); person.setName("Andy"); person.setAge(32); // Encoders are created for Java beans Encoder<Person> personEncoder = Encoders.bean(Person.class); Dataset<Person> javaBeanDS = spark.createDataset( Collections.singletonList(person), personEncoder ); javaBeanDS.show(); // +---+----+ // |age|name| // +---+----+ // | 32|Andy| // +---+----+ // Encoders for most common types are provided in class Encoders Encoder<Integer> integerEncoder = Encoders.INT(); Dataset<Integer> primitiveDS = spark.createDataset(Arrays.asList(1, 2, 3), integerEncoder); Dataset<Integer> transformedDS = primitiveDS.map(new MapFunction<Integer, Integer>() { @Override public Integer call(Integer value) throws Exception { return value + 1; } }, integerEncoder); transformedDS.collect(); // Returns [2, 3, 4] // DataFrames can be converted to a Dataset by providing a class. Mapping based on name String path = "examples/src/main/resources/people.json"; Dataset<Person> peopleDS = spark.read().json(path).as(personEncoder); peopleDS.show(); // +----+-------+ // | age| name| // +----+-------+ // |null|Michael| // | 30| Andy| // | 19| Justin| // +----+-------+
Interoperability with RDD
Spark SQL has two ways to convert RDD to DataFrame.
Using reflection mechanism, the schema containing RDD of specified type object is derived. This reflection-based approach makes the code more concise and is recommended if you know the data schema beforehand.
Programming builds a schema and applies it to the specified RDD. This approach is more verbose, but if you don't know in advance what fields the data has, or if the data schema is read in at run time, you probably need to use it.
Derivation of schema by reflection
Spark SQL supports automatic conversion of a JavaBean's RDD to a DataFrame. Currently, Spark SQL does not support JavaBean conversion that includes Map domains. You can create a JavaBean that implements the Serializable interface.
import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.function.Function; import org.apache.spark.api.java.function.MapFunction; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.Encoder; import org.apache.spark.sql.Encoders; // Create an RDD of Person objects from a text file JavaRDD<Person> peopleRDD = spark.read() .textFile("examples/src/main/resources/people.txt") .javaRDD() .map(new Function<String, Person>() { @Override public Person call(String line) throws Exception { String[] parts = line.split(","); Person person = new Person(); person.setName(parts[0]); person.setAge(Integer.parseInt(parts[1].trim())); return person; } }); // Apply a schema to an RDD of JavaBeans to get a DataFrame Dataset<Row> peopleDF = spark.createDataFrame(peopleRDD, Person.class); // Register the DataFrame as a temporary view peopleDF.createOrReplaceTempView("people"); // SQL statements can be run by using the sql methods provided by spark Dataset<Row> teenagersDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19"); // The columns of a row in the result can be accessed by field index Encoder<String> stringEncoder = Encoders.STRING(); Dataset<String> teenagerNamesByIndexDF = teenagersDF.map(new MapFunction<Row, String>() { @Override public String call(Row row) throws Exception { return "Name: " + row.getString(0); } }, stringEncoder); teenagerNamesByIndexDF.show(); // +------------+ // | value| // +------------+ // |Name: Justin| // +------------+ // or by field name Dataset<String> teenagerNamesByFieldDF = teenagersDF.map(new MapFunction<Row, String>() { @Override public String call(Row row) throws Exception { return "Name: " + row.<String>getAs("name"); } }, stringEncoder); teenagerNamesByFieldDF.show(); // +------------+ // | value| // +------------+ // |Name: Justin| // +------------+
Programming Definition Schema
You may need to create a DataFrame programmatically in three steps:
Create an RDD containing Row objects from an existing RDD
Create a schema with StructType that matches the structure of the RDD created in Step 1
Apply the resulting schema to RDD containing Row objects and call this method to achieve this step: SparkSession.createDataFrame
import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.function.Function; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; // Create an RDD JavaRDD<String> peopleRDD = spark.sparkContext() .textFile("examples/src/main/resources/people.txt", 1) .toJavaRDD(); // The schema is encoded in a string String schemaString = "name age"; // Generate the schema based on the string of schema List<StructField> fields = new ArrayList<>(); for (String fieldName : schemaString.split(" ")) { StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true); fields.add(field); } StructType schema = DataTypes.createStructType(fields); // Convert records of the RDD (people) to Rows JavaRDD<Row> rowRDD = peopleRDD.map(new Function<String, Row>() { @Override public Row call(String record) throws Exception { String[] attributes = record.split(","); return RowFactory.create(attributes[0], attributes[1].trim()); } }); // Apply the schema to the RDD Dataset<Row> peopleDataFrame = spark.createDataFrame(rowRDD, schema); // Creates a temporary view using the DataFrame peopleDataFrame.createOrReplaceTempView("people"); // SQL can be run over a temporary view created using DataFrames Dataset<Row> results = spark.sql("SELECT name FROM people"); // The results of SQL queries are DataFrames and support all the normal RDD operations // The columns of a row in the result can be accessed by field index or by field name Dataset<String> namesDS = results.map(new MapFunction<Row, String>() { @Override public String call(Row row) throws Exception { return "Name: " + row.getString(0); } }, Encoders.STRING()); namesDS.show(); // +-------------+ // | value| // +-------------+ // |Name: Michael| // | Name: Andy| // | Name: Justin| // +-------------+
Data Sources Data Sources
Spark SQL supports the operation of a series of different data sources based on the DataFrame. The DataFrame can be operated as either a regular RDD or registered as a temporary table to query. After registering the DataFrame as a table, you can execute SQL statements based on that table. This section describes some common ways to load and save data, including different Spark data sources.
In the simplest case, all operations load data with the default type data source (the default is Parquet, unless the spark.sql.sources.default configuration is modified).
Dataset<Row> usersDF = spark.read().load("examples/src/main/resources/users.parquet"); usersDF.select("name", "favorite_color").write().save("namesAndFavColors.parquet");
You can also specify the data source manually and set some additional option parameters. The data source can be specified by its full name (for example, org.apache.spark.sql.parquet), while for the built-in supported data source, the abbreviated name (json, parquet, jdbc) can be used. The DataFrame created by any type of data source can be converted to other types of data formats using the following syntax.
Dataset<Row> peopleDF = spark.read().format("json").load("examples/src/main/resources/people.json"); peopleDF.select("name", "age").write().format("parquet").save("namesAndAges.parquet");
Direct use of SQL for files, Spark SQL also supports direct use of SQL query for files, do not need to use read method to load files.
Dataset<Row> sqlDF = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`");
Preservation mode
The Save operation has an optional parameter, SaveMode, with which you can specify how to handle situations where data already exists. Importantly, these save modes are not locked, so their operation is not atomic. In addition, if you use Overwrite mode, the actual operation is to delete the data first, and then write new data.
When SaveMode.ErrorIfExists (default) "error" (default) (default mode) saves data from the DataFrame to the data source, throw an exception if the data already exists.
SaveMode. Append "append" If the data or table already exists, append the data of the DataFrame to the end of the existing data.
SaveMode. Overwrite "overwrite" If data or tables already exist, overwrite them with DataFrame data.
SaveMode.Ignore "ignore" If the data already exists, give up saving the DataFrame data. This is similar to CREATE TABLE IF NOT EXISTS in SQL.
Save to persistence table
The DataFrame can use the saveAsTable method to persist content to Hive's Metastore table. By default, saveAsTable creates a "managed table", which means that the location of the table data is controlled by the metastore. Similarly, if a table is deleted, its data is deleted synchronously.
Parquet file
Parquet is a popular determinant storage format. Spark SQL provides read and write support for Parquet files, and Parquet files can automatically save the schema of the original data. When you write a Parquet file, all fields are automatically converted to nullable for backward compatibility.
Programming Loading Data
Dataset<Row> peopleDF = spark.read().json("examples/src/main/resources/people.json"); // DataFrames can be saved as Parquet files, maintaining the schema information peopleDF.write().parquet("people.parquet"); // Read in the Parquet file created above. // Parquet files are self-describing so the schema is preserved // The result of loading a parquet file is also a DataFrame Dataset<Row> parquetFileDF = spark.read().parquet("people.parquet"); // Parquet files can also be used to create a temporary view and then used in SQL statements parquetFileDF.createOrReplaceTempView("parquetFile"); Dataset<Row> namesDF = spark.sql("SELECT name FROM parquetFile WHERE age BETWEEN 13 AND 19"); Dataset<String> namesDS = namesDF.map(new MapFunction<Row, String>() { public String call(Row row) { return "Name: " + row.getString(0); } }, Encoders.STRING()); namesDS.show(); // +------------+ // | value| // +------------+ // |Name: Justin| // +------------+
Other key features: See the official documentation
Partition Discovery
Schema merge
Hive metastore Parquet table transformation
Refresh metadata
To configure
JSON Data Set
Spark SQL can automatically deduce its schema and return to Dataset < Row> when loading JSON data. This transformation can be achieved by reading a RDD or JSON file containing String with SparkSession.read().json().
Note that the commonly used JSON file is just a file that contains some JSON data, not the JSON format file we need. The JSON format file must be an independent, complete JSON object for each line. Therefore, a regular multi-line JSON file often fails to load.
import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; // A JSON dataset is pointed to by path. // The path can be either a single text file or a directory storing text files Dataset<Row> people = spark.read().json("examples/src/main/resources/people.json"); // The inferred schema can be visualized using the printSchema() method people.printSchema(); // root // |-- age: long (nullable = true) // |-- name: string (nullable = true) // Creates a temporary view using the DataFrame people.createOrReplaceTempView("people"); // SQL statements can be run by using the sql methods provided by spark Dataset<Row> namesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19"); namesDF.show(); // +------+ // | name| // +------+ // |Justin| // +------+ // Alternatively, a DataFrame can be created for a JSON dataset represented by // an RDD[String] storing one JSON object per string. List<String> jsonData = Arrays.asList( "{\"name\":\"Yin\",\"address\":{\"city\":\"Columbus\",\"state\":\"Ohio\"}}"); JavaRDD<String> anotherPeopleRDD = new JavaSparkContext(spark.sparkContext()).parallelize(jsonData); Dataset anotherPeople = spark.read().json(anotherPeopleRDD); anotherPeople.show(); // +---------------+----+ // | address|name| // +---------------+----+ // |[Columbus,Ohio]| Yin| // +---------------+----+
Hive table
Spark SQL supports reading and writing data from Hive. However, Hive has too many dependencies to include Hive in the default Spark distribution package. To support Hive, you need to put relevant jar packages in the classpath (note all nodes).
Configuration files hive-site.xml, core-site.xml (for security configuration), and hdfs-site.xml (for HDFS configuration) are placed in conf/.
First you need to initialize SparkSession, but if you don't have a Hive deployment, you can still get Hive support.
import org.apache.spark.api.java.function.MapFunction; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Encoders; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public static class Record implements Serializable { private int key; private String value; public int getKey() { return key; } public void setKey(int key) { this.key = key; } public String getValue() { return value; } public void setValue(String value) { this.value = value; } } // warehouseLocation points to the default location for managed databases and tables String warehouseLocation = "spark-warehouse"; SparkSession spark = SparkSession .builder() .appName("Java Spark Hive Example") .config("spark.sql.warehouse.dir", warehouseLocation) .enableHiveSupport() .getOrCreate(); spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)"); spark.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src"); // Queries are expressed in HiveQL spark.sql("SELECT * FROM src").show(); // +---+-------+ // |key| value| // +---+-------+ // |238|val_238| // | 86| val_86| // |311|val_311| // ... // Aggregation queries are also supported. spark.sql("SELECT COUNT(*) FROM src").show(); // +--------+ // |count(1)| // +--------+ // | 500 | // +--------+ // The results of SQL queries are themselves DataFrames and support all normal functions. Dataset<Row> sqlDF = spark.sql("SELECT key, value FROM src WHERE key < 10 ORDER BY key"); // The items in DaraFrames are of type Row, which lets you to access each column by ordinal. Dataset<String> stringsDS = sqlDF.map(new MapFunction<Row, String>() { @Override public String call(Row row) throws Exception { return "Key: " + row.get(0) + ", Value: " + row.get(1); } }, Encoders.STRING()); stringsDS.show(); // +--------------------+ // | value| // +--------------------+ // |Key: 0, Value: val_0| // |Key: 0, Value: val_0| // |Key: 0, Value: val_0| // ... // You can also use DataFrames to create temporary views within a SparkSession. List<Record> records = new ArrayList<>(); for (int key = 1; key < 100; key++) { Record record = new Record(); record.setKey(key); record.setValue("val_" + key); records.add(record); } Dataset<Row> recordsDF = spark.createDataFrame(records, Record.class); recordsDF.createOrReplaceTempView("records"); // Queries can then join DataFrames data with data stored in Hive. spark.sql("SELECT * FROM records r JOIN src s ON r.key = s.key").show(); // +---+------+---+------+ // |key| value|key| value| // +---+------+---+------+ // | 2| val_2| 2| val_2| // | 2| val_2| 2| val_2| // | 4| val_4| 4| val_4| // ...
Interacting with different versions of Live Metastore: Sketchy, see the official documentation
Connecting other databases with JDBC
Spark SQL can also access other databases with JDBC. This function should take precedence over using JdbcRDD. Because it returns a DataFrame, which is simpler to operate in Spark SQL and easier to interact with data from other data sources.
First, you need to include the JDBC driver for the corresponding database in the spark classpath. The following line includes the database driver for accessing postgres
bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar
Tables in remote databases can be loaded through the Data Sources API using DataFrame or SparkSQL temporary tables. The following is a list of options:
url: ordinary jdbc url
The JDBC table that the dbtable needs to read. Note that anything that can be filled in the where clause of SQL can be filled in here. (You can either fill in the complete form name or fill in the bracketed sub-query statement)
The class name of driver JDBC driver. This class must be available on both master and worker nodes so that each node can register the driver into the subsystem of JDBC.
Fetch size JDBC fetch size determines how many rows of data are fetched at a time. The default is 1000.
Isolation Level can be selected as NONE, READ_COMMITTED, READ_UNCOMMITTED, REPEATABLE_READ, or SERIALIZABLE by default READ_UNCOMMITTED.
truncate This is a JDBC writer related option. When SaveMode.Overwrite is enabled, this option causes Spark to truncate an existing table instead of dropping and recreating it. This can be more efficient, and prevents the table metadata (e.g., indices) from being removed. However, it will not work in some cases, such as when the new data has a different schema. It defaults to false. This option applies only to writing.
createTableOptions This is a JDBC writer related option. If specified, this option allows setting of database-specific table and partition options when creating a table (e.g., CREATE TABLE t (name string) ENGINE=InnoDB.). This option applies only to writing.
// Note: JDBC loading and saving can be achieved via either the load/save or jdbc methods // Loading data from a JDBC source Dataset<Row> jdbcDF = spark.read() .format("jdbc") .option("url", "jdbc:postgresql:dbserver") .option("dbtable", "schema.tablename") .option("user", "username") .option("password", "password") .load(); Properties connectionProperties = new Properties(); connectionProperties.put("user", "username"); connectionProperties.put("password", "password"); Dataset<Row> jdbcDF2 = spark.read() .jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties); // Saving data to a JDBC source jdbcDF.write() .format("jdbc") .option("url", "jdbc:postgresql:dbserver") .option("dbtable", "schema.tablename") .option("user", "username") .option("password", "password") .save(); jdbcDF2.write() .jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties);
Troubleshooting
The JDBC driver class must be on all client session s or executor s, visible to the native classloader of java. This is because Java's Driver Manager does security checks before opening a connection and ignores all drivers that are not visible to the original classloader. The easiest way is to modify compute_classpath.sh on all worker nodes and include the driver jar package you need. Some databases, such as H2, capitalize all names. For these databases, capitalization must also be used in Spark SQL.
Performance adjustment
Memory cache
Spark SQL can cache tables into memory in column storage format by calling SQLContext.cacheTable("tableName") or DataFrame.cache(). Subsequently, Spark SQL scans the necessary columns and automatically adjusts the compression ratio to reduce memory footprint and GC pressure. You can also use SQLContext.uncacheTable("tableName") to delete tables in memory.
You can also use SQLContext.setConf or run the SET key=value command in the SQL statement to configure the cache in memory.
Spark. sql. inMemoryColumnarStorage. compressed true. If set to true, Spark SQL will automatically select a separate compression encoding for each column based on data statistics.
Spark. sql. inMemoryColumnarStorage. batchSize 10000 controls the size of the column cache batch. Increasing the batch size can improve memory utilization and compression rate, but it also brings the risk of OOM (Out Of Memory).
Other configuration options
The following options can also be used to tune performance for query tasks. However, these options may be abandoned in the future because spark will support more and more automatic optimization.
spark.sql.files.maxPartitionBytes 134217728 (128 MB) The maximum number of bytes to pack into a single partition when reading files.
spark.sql.files.openCostInBytes 4194304 (4 MB) The estimated cost to open a file, measured by the number of bytes could be scanned in the same time. This is used when putting multiple files into a partition. It is better to over estimated, then the partitions with small files will be faster than partitions with bigger files (which is scheduled first).
spark.sql.broadcastTimeout 300 Timeout in seconds for the broadcast wait time in broadcast joins
When spark. sql. autoBroadcast Join Threshold 10485760 (10MB) configures the join operation, it can be used as the maximum table size of the broadcast variable. Set to - 1 to disable broadcasting. Note that the current metadata statistics only support tables in Hive metastore, and you need to run this command: AnalyYSE TABLE < tableName > COMPUTE STATISTICS noscan
Spark. sql. shuffle. partitions 200 configures the number of partitions used in shuffle (join or aggregation operations).
Distributed SQL Engine
Spark SQL can be used as a distributed query engine for JDBC/ODBC or command line tools. In this mode, end users or applications can directly run SQL queries in Spark SQL without writing any code.
Slightly.
Use Spark SQL command line tools
Spark SQL CLI is a convenient tool that can run hive metastore service with local mode and execute input queries on the command line. Note that Spark SQL CLI does not currently support communication with Thrift JDBC server.
Start a Spark SQL CLI in the spark directory with the following command
./bin/spark-sql
Hive configuration is set in hive-site.xml, core-site.xml and hdfs-site.xml under conf directory. You can use this command to view the complete list of options:. / bin/spark-sql - help
polymerization
The built-in DataFrame functions provide common aggregation operations such as count(), countDistinct(), avg(), max(), min(), etc. Users can also customize some aggregation functions.
Untyped User-Defined Aggregate Functions
import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.expressions.MutableAggregationBuffer; import org.apache.spark.sql.expressions.UserDefinedAggregateFunction; import org.apache.spark.sql.types.DataType; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; public static class MyAverage extends UserDefinedAggregateFunction { private StructType inputSchema; private StructType bufferSchema; public MyAverage() { List<StructField> inputFields = new ArrayList<>(); inputFields.add(DataTypes.createStructField("inputColumn", DataTypes.LongType, true)); inputSchema = DataTypes.createStructType(inputFields); List<StructField> bufferFields = new ArrayList<>(); bufferFields.add(DataTypes.createStructField("sum", DataTypes.LongType, true)); bufferFields.add(DataTypes.createStructField("count", DataTypes.LongType, true)); bufferSchema = DataTypes.createStructType(bufferFields); } // Data types of input arguments of this aggregate function public StructType inputSchema() { return inputSchema; } // Data types of values in the aggregation buffer public StructType bufferSchema() { return bufferSchema; } // The data type of the returned value public DataType dataType() { return DataTypes.DoubleType; } // Whether this function always returns the same output on the identical input public boolean deterministic() { return true; } // Initializes the given aggregation buffer. The buffer itself is a `Row` that in addition to // standard methods like retrieving a value at an index (e.g., get(), getBoolean()), provides // the opportunity to update its values. Note that arrays and maps inside the buffer are still // immutable. public void initialize(MutableAggregationBuffer buffer) { buffer.update(0, 0L); buffer.update(1, 0L); } // Updates the given aggregation buffer `buffer` with new input data from `input` public void update(MutableAggregationBuffer buffer, Row input) { if (!input.isNullAt(0)) { long updatedSum = buffer.getLong(0) + input.getLong(0); long updatedCount = buffer.getLong(1) + 1; buffer.update(0, updatedSum); buffer.update(1, updatedCount); } } // Merges two aggregation buffers and stores the updated buffer values back to `buffer1` public void merge(MutableAggregationBuffer buffer1, Row buffer2) { long mergedSum = buffer1.getLong(0) + buffer2.getLong(0); long mergedCount = buffer1.getLong(1) + buffer2.getLong(1); buffer1.update(0, mergedSum); buffer1.update(1, mergedCount); } // Calculates the final result public Double evaluate(Row buffer) { return ((double) buffer.getLong(0)) / buffer.getLong(1); } } // Register the function to access it spark.udf().register("myAverage", new MyAverage()); Dataset<Row> df = spark.read().json("examples/src/main/resources/employees.json"); df.createOrReplaceTempView("employees"); df.show(); // +-------+------+ // | name|salary| // +-------+------+ // |Michael| 3000| // | Andy| 4500| // | Justin| 3500| // | Berta| 4000| // +-------+------+ Dataset<Row> result = spark.sql("SELECT myAverage(salary) as average_salary FROM employees"); result.show(); // +--------------+ // |average_salary| // +--------------+ // | 3750.0| // +--------------+
Type-Safe User-Defined Aggregate Functions
import java.io.Serializable; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Encoder; import org.apache.spark.sql.Encoders; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.TypedColumn; import org.apache.spark.sql.expressions.Aggregator; public static class Employee implements Serializable { private String name; private long salary; // Constructors, getters, setters... } public static class Average implements Serializable { private long sum; private long count; // Constructors, getters, setters... } public static class MyAverage extends Aggregator<Employee, Average, Double> { // A zero value for this aggregation. Should satisfy the property that any b + zero = b public Average zero() { return new Average(0L, 0L); } // Combine two values to produce a new value. For performance, the function may modify `buffer` // and return it instead of constructing a new object public Average reduce(Average buffer, Employee employee) { long newSum = buffer.getSum() + employee.getSalary(); long newCount = buffer.getCount() + 1; buffer.setSum(newSum); buffer.setCount(newCount); return buffer; } // Merge two intermediate values public Average merge(Average b1, Average b2) { long mergedSum = b1.getSum() + b2.getSum(); long mergedCount = b1.getCount() + b2.getCount(); b1.setSum(mergedSum); b1.setCount(mergedCount); return b1; } // Transform the output of the reduction public Double finish(Average reduction) { return ((double) reduction.getSum()) / reduction.getCount(); } // Specifies the Encoder for the intermediate value type public Encoder<Average> bufferEncoder() { return Encoders.bean(Average.class); } // Specifies the Encoder for the final output value type public Encoder<Double> outputEncoder() { return Encoders.DOUBLE(); } } Encoder<Employee> employeeEncoder = Encoders.bean(Employee.class); String path = "examples/src/main/resources/employees.json"; Dataset<Employee> ds = spark.read().json(path).as(employeeEncoder); ds.show(); // +-------+------+ // | name|salary| // +-------+------+ // |Michael| 3000| // | Andy| 4500| // | Justin| 3500| // | Berta| 4000| // +-------+------+ MyAverage myAverage = new MyAverage(); // Convert the function to a `TypedColumn` and give it a name TypedColumn<Employee, Double> averageSalary = myAverage.toColumn().name("average_salary"); Dataset<Double> result = ds.select(averageSalary); result.show(); // +--------------+ // |average_salary| // +--------------+ // | 3750.0| // +--------------+
Reference resources:
http://spark.apache.org/docs/...
http://ifeve.com/apache-spark/