Requirement: read out the data in hive and write it into es.
Environment: spark 2.0.2
1. enableHiveSupport() is set in sparksession
SparkConf conf = new SparkConf().setAppName("appName").setMaster("local[*]"); SparkSession spark = SparkSession .builder() .appName("Java Spark SQL basic example hive") .config(conf) .enableHiveSupport() //Support hive .getOrCreate();
2. pom add dependency (not required for hive version)
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_2.10</artifactId> <version>1.2.1</version> </dependency>
perhaps
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_2.11</artifactId> <version>2.3.0</version> </dependency>
3. Put the configuration file under the conf of spark
Reference resources Official documents
Configuration of Hive is done by placing your hive-site.xml, core-site.xml (for security configuration), and hdfs-site.xml (for HDFS configuration) file in conf/.
4. spark.sql read data
SparkSession spark = ESMysqlSpark.getSession(); String querySql = "SELECT * FROM test.table"; spark.sql(querySql);
5. hive sql statement
Requirement: merge two fields to form a new string.
You can register a function with udf first
spark.udf().register("mode", new UDF2<String, Long, String>() { public String call(String types, Long time) throws Exception { return types.replace(".", "") + String.valueOf(time); }}, DataTypes.StringType);
Find the average value of a field (the output is int type), the maximum / minimum value of a field, the formatted output of a date field, and so on. This requirement can be implemented in the hive statement.
String querySql = String.format("SELECT mode(ip, unix_timestamp()) id," + " ip, " + "cast(avg(t1) as bigint) f1, " + "cast(avg(t2) as bigint) f2, " + "min(t3) minSpeed, " + "max(t4) maxSpeed, " + "from_unixtime(unix_timestamp(),'yyyy-MM-dd HH:mm:ss') time " + "FROM test.table " + "where time > %s " + "group by ip ", timeLimit);
- UNIX? Timestamp get current timestamp
- Cast (expression as data? Type) data type conversion
- From ﹐ unixtime (UNIX ﹐ timestamp(), 'yyyy MM DD HH: mm: Ss') date format output
6. write es
Check whether the data is correct through ds.show()
Dataset ds = spark.sql(querySql); EsSparkSQL.saveToEs(ds, "sha_parking/t_speedInformation");
Pack Project
- mvn package
- mvn assembly:assembly contains dependency package
Problems encountered
If the data cannot be read. Confirm the following configuration first:
- Make sure 127.0.0.1 hostname has been added in / etc/hosts
- Check $SPARK_HOME/conf/spark-env.sh to make sure the ip address is correct
Question: Hive Schema version 2.1.0 does not match metastore
mysql -uroot -p use hive; select * from VERSION; update VERSION set SCHEMA_VERSION='2.1.1' where VER_ID=1;