spark reads hive data java

Keywords: Big Data Spark hive SQL xml

Requirement: read out the data in hive and write it into es.

Environment: spark 2.0.2

1. enableHiveSupport() is set in sparksession
		SparkConf conf = new SparkConf().setAppName("appName").setMaster("local[*]");

		SparkSession spark = SparkSession
                .builder()
                .appName("Java Spark SQL basic example hive")
                .config(conf)
                .enableHiveSupport()  //Support hive
                .getOrCreate();

2. pom add dependency (not required for hive version)
		<dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_2.10</artifactId>
            <version>1.2.1</version>
        </dependency>

perhaps

		<dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_2.11</artifactId>
            <version>2.3.0</version>
        </dependency>

3. Put the configuration file under the conf of spark

Reference resources Official documents

Configuration of Hive is done by placing your hive-site.xml, core-site.xml (for security configuration), and hdfs-site.xml (for HDFS configuration) file in conf/.
4. spark.sql read data
		SparkSession spark = ESMysqlSpark.getSession();
		String querySql = "SELECT * FROM test.table";
		spark.sql(querySql);

5. hive sql statement

Requirement: merge two fields to form a new string.

You can register a function with udf first

		spark.udf().register("mode", new UDF2<String, Long, String>() {
                public String call(String types, Long time) throws Exception {
                    return types.replace(".", "") + String.valueOf(time);
                }}, DataTypes.StringType);

Find the average value of a field (the output is int type), the maximum / minimum value of a field, the formatted output of a date field, and so on. This requirement can be implemented in the hive statement.

			String querySql = String.format("SELECT mode(ip, unix_timestamp()) id," +
                    " ip, " +
                    "cast(avg(t1) as bigint) f1, " +
                    "cast(avg(t2) as bigint) f2, " +
                    "min(t3) minSpeed, " +
                    "max(t4) maxSpeed, " +
                    "from_unixtime(unix_timestamp(),'yyyy-MM-dd HH:mm:ss') time " +
                    "FROM test.table " +
                    "where time > %s " +
                    "group by ip ", timeLimit);
  • UNIX? Timestamp get current timestamp
  • Cast (expression as data? Type) data type conversion
  • From ﹐ unixtime (UNIX ﹐ timestamp(), 'yyyy MM DD HH: mm: Ss') date format output
6. write es

Check whether the data is correct through ds.show()

			Dataset ds = spark.sql(querySql);
            EsSparkSQL.saveToEs(ds, "sha_parking/t_speedInformation");

Pack Project

  • mvn package
  • mvn assembly:assembly contains dependency package

Problems encountered

If the data cannot be read. Confirm the following configuration first:

  1. Make sure 127.0.0.1 hostname has been added in / etc/hosts
  2. Check $SPARK_HOME/conf/spark-env.sh to make sure the ip address is correct
Question: Hive Schema version 2.1.0 does not match metastore
mysql -uroot -p  
use hive;
select * from VERSION;
update VERSION set SCHEMA_VERSION='2.1.1' where  VER_ID=1;

Posted by NSW42 on Tue, 10 Dec 2019 14:30:32 -0800