Spark SQL uses beeline to access hive warehouse

Keywords: Big Data Hadoop hive Spark JDBC

I. add hive-site.xml

Add the hive-site.xml configuration file under $SPARK_HOME/conf in order to access hive metadata normally

vim hive-site.xml
<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
            <value>jdbc:mysql://192.168.1.201:3306/hiveDB?createDatabaseIfNotExist=true</value>
        </property>

    <property>
            <name>javax.jdo.option.ConnectionDriverName</name>
            <value>com.mysql.jdbc.Driver</value>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
            <value>root</value>
        </property>

    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
            <value>123456</value>
        </property>
        <!-- hive Output column name when querying -->
    <property>
        <name>hive.cli.print.header</name>
        <value>true</value>
    </property>
    <!-- Display current database name -->
    <property>
        <name>hive.cli.print.current.db</name>
        <value>true</value>
    </property>
</configuration>

Note: there is no need to deploy hive on the node, as long as you can connect to the metadata of hive!

2. Start the thriftserver service

[hadoop@hadoop003 spark]$ ./sbin/start-thriftserver.sh --jars ~/softwares/mysql-connector-java-5.1.47.jar 
starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, 
logging to /home/hadoop/app/spark/logs/spark-hadoop-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-hadoop003.out

Check the log to make sure that the thriftserver service starts normally

[hadoop@hadoop003 spark]$ tail -50f /home/hadoop/app/spark/logs/spark-hadoop-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-hadoop003.out

19/05/21 09:39:14 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
19/05/21 09:39:15 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
19/05/21 09:39:15 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
19/05/21 09:39:15 INFO metastore.MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
19/05/21 09:39:15 INFO metastore.ObjectStore: Initialized ObjectStore
19/05/21 09:39:15 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
19/05/21 09:39:15 WARN metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException
19/05/21 09:39:15 INFO metastore.HiveMetaStore: Added admin role in metastore
19/05/21 09:39:15 INFO metastore.HiveMetaStore: Added public role in metastore
19/05/21 09:39:15 INFO metastore.HiveMetaStore: No user is added in admin role, since config is empty
19/05/21 09:39:15 INFO metastore.HiveMetaStore: 0: get_all_databases
19/05/21 09:39:15 INFO HiveMetaStore.audit: ugi=hadoop  ip=unknown-ip-addr  cmd=get_all_databases   
19/05/21 09:39:15 INFO metastore.HiveMetaStore: 0: get_functions: db=default pat=*
19/05/21 09:39:15 INFO HiveMetaStore.audit: ugi=hadoop  ip=unknown-ip-addr  cmd=get_functions: db=default pat=* 
19/05/21 09:39:15 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
19/05/21 09:39:16 INFO session.SessionState: Created local directory: /tmp/73df82dd-1fd3-4dd5-97f1-680d53bd44bc_resources
19/05/21 09:39:16 INFO session.SessionState: Created HDFS directory: /tmp/hive/hadoop/73df82dd-1fd3-4dd5-97f1-680d53bd44bc
19/05/21 09:39:16 INFO session.SessionState: Created local directory: /tmp/hadoop/73df82dd-1fd3-4dd5-97f1-680d53bd44bc
19/05/21 09:39:16 INFO session.SessionState: Created HDFS directory: /tmp/hive/hadoop/73df82dd-1fd3-4dd5-97f1-680d53bd44bc/_tmp_space.db
19/05/21 09:39:16 INFO client.HiveClientImpl: Warehouse location for Hive client (version 1.2.2) is file:/home/hadoop/app/spark-2.4.2-bin-hadoop-2.6.0-cdh5.7.0/spark-warehouse
19/05/21 09:39:16 INFO session.SessionManager: Operation log root directory is created: /tmp/hadoop/operation_logs
19/05/21 09:39:16 INFO session.SessionManager: HiveServer2: Background operation thread pool size: 100
19/05/21 09:39:16 INFO session.SessionManager: HiveServer2: Background operation thread wait queue size: 100
19/05/21 09:39:16 INFO session.SessionManager: HiveServer2: Background operation thread keepalive time: 10 seconds
19/05/21 09:39:16 INFO service.AbstractService: Service:OperationManager is inited.
19/05/21 09:39:16 INFO service.AbstractService: Service:SessionManager is inited.
19/05/21 09:39:16 INFO service.AbstractService: Service: CLIService is inited.
19/05/21 09:39:16 INFO service.AbstractService: Service:ThriftBinaryCLIService is inited.
19/05/21 09:39:16 INFO service.AbstractService: Service: HiveServer2 is inited.
19/05/21 09:39:16 INFO service.AbstractService: Service:OperationManager is started.
19/05/21 09:39:16 INFO service.AbstractService: Service:SessionManager is started.
19/05/21 09:39:16 INFO service.AbstractService: Service:CLIService is started.
19/05/21 09:39:16 INFO metastore.ObjectStore: ObjectStore, initialize called
19/05/21 09:39:16 INFO DataNucleus.Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
19/05/21 09:39:16 INFO metastore.MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
19/05/21 09:39:16 INFO metastore.ObjectStore: Initialized ObjectStore
19/05/21 09:39:16 INFO metastore.HiveMetaStore: 0: get_databases: default
19/05/21 09:39:16 INFO HiveMetaStore.audit: ugi=hadoop  ip=unknown-ip-addr  cmd=get_databases: default  
19/05/21 09:39:16 INFO metastore.HiveMetaStore: 0: Shutting down the object store...
19/05/21 09:39:16 INFO HiveMetaStore.audit: ugi=hadoop  ip=unknown-ip-addr  cmd=Shutting down the object store...   
19/05/21 09:39:16 INFO metastore.HiveMetaStore: 0: Metastore shutdown complete.
19/05/21 09:39:16 INFO HiveMetaStore.audit: ugi=hadoop  ip=unknown-ip-addr  cmd=Metastore shutdown complete.    
19/05/21 09:39:16 INFO service.AbstractService: Service:ThriftBinaryCLIService is started.
19/05/21 09:39:16 INFO service.AbstractService: Service:HiveServer2 is started.
19/05/21 09:39:16 INFO thriftserver.HiveThriftServer2: HiveThriftServer2 started
19/05/21 09:39:16 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4a7c72af{/sqlserver,null,AVAILABLE,@Spark}
19/05/21 09:39:16 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3e24bab6{/sqlserver/json,null,AVAILABLE,@Spark}
19/05/21 09:39:16 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@391e2a{/sqlserver/session,null,AVAILABLE,@Spark}
19/05/21 09:39:16 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@641cea11{/sqlserver/session/json,null,AVAILABLE,@Spark}
19/05/21 09:39:16 INFO thrift.ThriftCLIService: 
Starting ThriftBinaryCLIService on port 10000 with 5...500 worker threads#Flag start successful

III. start beeline

[hadoop@hadoop003 spark]$ ./bin/beeline -u jdbc:hive2://localhost:10000 -n hadoop
Connecting to jdbc:hive2://localhost:10000
19/05/21 09:46:19 INFO jdbc.Utils: Supplied authorities: localhost:10000
19/05/21 09:46:19 INFO jdbc.Utils: Resolved authority: localhost:10000
19/05/21 09:46:19 INFO jdbc.HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://localhost:10000
Connected to: Spark SQL (version 2.4.2)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1.spark2 by Apache Hive
0: jdbc:hive2://localhost:10000> select * from student.student limit 5;
+---------+-----------+-----------------+--------------------------------------------+--+
| stu_id  | stu_name  |  stu_phone_num  |                 stu_email                  |
+---------+-----------+-----------------+--------------------------------------------+--+
| 1       | Burke     | 1-300-746-8446  | ullamcorper.velit.in@ametnullaDonec.co.uk  |
| 2       | Kamal     | 1-668-571-5046  | pede.Suspendisse@interdumenim.edu          |
| 3       | Olga      | 1-956-311-1686  | Aenean.eget.metus@dictumcursusNunc.edu     |
| 4       | Belle     | 1-246-894-6340  | vitae.aliquet.nec@neque.co.uk              |
| 5       | Trevor    | 1-300-527-4967  | dapibus.id@acturpisegestas.net             |
+---------+-----------+-----------------+--------------------------------------------+--+
5 rows selected (3.275 seconds)
0: jdbc:hive2://localhost:10000> 

Startup success

Four, pay attention to

1. It is better to start beeline in spark/bin directory
Because if you start spark beeline and deploy hive, it happens that your hive environment variable is just before the spark environment variable, it is likely that you start the bee line of hive
For example:

[hadoop@hadoop003 spark]$ beeline
ls: cannot access /home/hadoop/app/spark/lib/spark-assembly-*.jar: No such file or directory
which: no hbase in (/home/hadoop/app/hive/bin:/home/hadoop/app/spark/bin:/home/hadoop/app/hadoop-2.6.0-cdh5.7.0//bin:/home/hadoop/app/hadoop-2.6.0-cdh5.7.0//sbin:/home/hadoop/app/zookeeper/bin:/usr/java/jdk1.8.0_131/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hadoop/bin)
Beeline version 1.1.0-cdh5.7.0 by Apache Hive  # Isn't this hive
beeline> 

Now you can view the environment variables

[hadoop@hadoop003 spark]$ cat ~/.bash_profile 
# .bash_profile

# Get the aliases and functions
if [ -f ~/.bashrc ]; then
    . ~/.bashrc
fi

# User specific environment and startup programs

PATH=$PATH:$HOME/bin

export PATH
#####JAVA_HOME#####
export JAVA_HOME=/usr/java/jdk1.8.0_131

####ZOOKEEPER_HOME####
export ZOOKEEPER_HOME=/home/hadoop/app/zookeeper

#####HADOOP_HOME######
export HADOOP_HOME=/home/hadoop/app/hadoop-2.6.0-cdh5.7.0/

export SPARK_HOME=/home/hadoop/app/spark

#####HIVE_HOME#####
export HIVE_HOME=/home/hadoop/app/hive
export PATH=$HIVE_HOME/bin:$SPARK_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$ZOOKEEPER_HOME/bin:$JAVA_HOME/bin:$PATH

Sure enough, if you don't specify the beeline path, you will give priority to hive's beeline

Posted by mrodrigues on Wed, 06 Nov 2019 14:06:19 -0800