1, Hive's architecture design and SQL statement review summary
1.1. Data warehouse
The definition put forward by Bill Inmon, the father of data warehouse, in his book "Building the Data Warehouse" published in 1991 is widely accepted. (DW for short) (DM)
Data Warehouse is a Subject Oriented, Integrated, non volatile and Time Variant data set used to support decision making support.
Subject Oriented: organize data by topic (multiple fact tables + multiple dimension tables)
Integrated: the data is placed in a non volatile data warehouse according to the theme: it will not change basically and will be fixed at birth. Therefore, there is no need for update and delete in the data warehouse to reflect the historical change: historical data, and each record represents a fact at a certain time in the past
Value of data warehouse: used to support decision making support.
Manage data of multiple different topics: a topic can be understood as a table or a class of closely related tables
Data warehouse: store data, regardless of the format of the data, read mode data storage system.
Relatively stable: Generally speaking, the data stored in Hive is log data that will not be modified
Support management decision-making: the results of query analysis are of great value
Data warehouse: for data, historical data, standardized structural data and text data, query and analysis, and support the select syntax in SQL!
1.2. Hive concept
Explanation on the official website: http://hive.apache.org
1. Hive from Facebook Implement and open source(google, yahoo)，hive yes facebook(mapreuce Low development efficiency) 2. Hive Is based on Hadoop A data warehouse tool 3. Hive The stored data is actually stored at the bottom HDFS upper 4. Hive take HDFS The structured data on is mapped to a database table, similar to excel perhaps msyql Table of 5. Hive provide HQL(Hive SQL)Query function 6. Hive The essence of is to SQL Statement to MapReduce Task running makes unfamiliar MapReduce It is convenient for users to use HQL Processing and calculation HDFS Structured data on, suitable for offline batch data calculation (improving development efficiency) 7. Hive Users can greatly simplify the writing of distributed computing programs and focus on business logic
Hive relies on HDFS to store data. Hive converts HQL into MapReduce for execution. Therefore, hive is a data warehouse tool based on Hadoop. In essence, it is a MapReduce computing framework based on HDFS to analyze and manage the data stored in HDFS.
1.3 hive architecture
[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-Tfm4FJ70-1633923115698)(assets/image-20210818122452585.png)]
Hive has four internal components: Driver, Compiler, Optimizer, and Executor
The Driver component completes the generation of HQL query statements from lexical analysis, syntax analysis, compilation, optimization, and logical execution plan. The generated logical execution plan is stored in HDFS and then called and executed by MapReduce.
Metadata in HDFS: (storage of large files: decentralized storage + redundant storage)
- Directory tree space
- Which small files do each large file correspond to
- A data block has three copies, so which three nodes are the three copies of the data block stored in?
Hive's metadata: (imagine the structured data stored on HDFS as a two-dimensional table)
- Table name (mapping relationship between table and data file)
- Field definitions (consisting of those fields)
- Row and column separator (since the file is imagined as a two-dimensional table, the boundaries between rows and columns)
- Data in the table (the file stored on HDFS, which table corresponds to the file?)
Hive's metadata management relies on a relational database! (enterprise best practice: MySQL) hive comes with an embedded small RDBM: derby
1.4. Summary of supported SQL syntax
1,select * from db.table1 2,select count(distinct uid) from db.table1 3,support select,union all,join(left,right,full join),like,where,having,limit Equivalent standard grammar 4,Support various common functions, aggregate functions and table functions 5,support json analysis 6,Support function customization: UDF(User Defined Function)/ UDAF/UDTF 7,I won't support it update and delete，Both old and new versions are not supported, and there is no need to support! 8,hive Although support in/exists((not supported in older versions), but hive Recommended use semi join Instead of implementation, it is more efficient. 9,support case ... when ...
1,support and Multiple conditions join，I won't support it or Multiple conditions join select a.*, b.* from a join b on a.id = b.id and a.name = b.name; √√√√√ select a.*, b.* from a join b on a.id = b.id or a.name = b.name; xxxxx 2,Cartesian product is not supported by default select a.*, b.* from a, b;
Your usage principle: most SQL and common SQL syntax are supported!
2, Hive's environment construction and syntax explanation
- hadoop cluster must be installed! Whether it is stand-alone, pseudo distributed, distributed, HA or federated.
- Also install a MySQL (as long as there is a MySQL that can be used and linked. No matter where it is)
- access denied
- Solution: remote connection permission problem
2.1. Environment construction
2.1.1. Dependent environment
Hive cannot run independently! You need to rely on an RDBMS (to help it store metadata) and a file system (to help it store real data)
- An RDBMS must be prepared (it can also be the derby embedded database provided by hive, which is not applicable to the production environment, and MySQL is recommended)
- You must prepare a Hadoop (in fact, you don't have to, you can directly use the linux file system to store it)
2.1.2. Version selection
What aspects should be considered when selecting the version of any technology: function, stability, maintainability, compatibility
Strategy: consider stable versions that are neither new nor old
There are currently two mainstream versions of Hive
- hive-1.x, of which hive-1.2.x is the most widely used version, and the last stable version is hive-1.2.2. If you use this version, your hive will not be able to integrate spark in the future.
- hive-2.x, the current mainstream version of hive. Among the current stable versions of hive-2.x, we choose to use hive-2.3.6
- apache-hive-2.3.6-bin.tar.gz installation package
- apache-hive-2.3.6-src.tar.gz source package
2.1.3. Built in Derby version
Purpose: test, learn and use quickly
1. Upload installation package
put -r apache-hive-2.3.6-bin.tar.gz
2. Unzip the installation package
[bigdata@bigdata05 ~]$ tar -zxvf ~/soft/apache-hive-2.3.6-bin.tar.gz -C ~/apps/
3. Enter the bin directory and run the hive script
[bigdata@bigdata05 ~]$ cd apps/apache-hive-2.3.6-bin/bin [bigdata@bigdata05 bin]$ ./hive
4. Test use
show databases; create database testdb; use testdb; create table student(id int, name string); insert into table student values ('1','huangbo'), ('2','xuzheng'),('3','wangbaoqiang'); select * from student;
2.1.4. External MySQL version
Purpose: suitable for enterprise production environment
Step 1: prepare MySQL and Hadoop
Please refer to the official website for installation or install by yourself MySQL，Or an available MySQL. Whatever you want MySQL Where, just Hive It can be connected. If Hive and MySQL It is not installed on the same server node, and usually the enterprise environment is not installed on the same node. Remember to Hive open MySQL Remote link permissions for. main points: 1,There is one that works MySQL All right 2,Here MySQL Configure remote connection permissions
Step 2: upload the installation package
Install apache-hive-2.3.6-bin.tar.gz into the server node where Hive is installed.
Upload this installation package to your server in a way you know. Or use the wget command to download it now. Either install the lrzsz command for centos and upload it through the rz command, or use the ftp tool such as filezilla.
Step 3: unzip the installation package into the corresponding Hive installation directory
[bigdata@bigdata05 ~]$ tar -zxvf ~/soft/apache-hive-2.3.6-bin.tar.gz -C ~/apps/
Step 4: modify the configuration file
[bigdata@bigdata05 ~]$ cd ~/apps/apache-hive-2.3.6-bin/conf [bigdata@bigdata05 ~]$ touch hive-site.xml
Note: this file does not exist by default. It is usually created by yourself. Of course, it can also be changed from hive-default.xml, which is not recommended.
[bigdata@bigdata05 ~]$ vi hive-site.xml
Add the following screenshot to the newly created configuration file:
<configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://bigdata02:3306/myhivemetadb236?createDatabaseIfNotExist=true&verifyServerCertificate=false&useSSL=false </value> <description>JDBC connect string for a JDBC metastore</description> <!-- If mysql and hive In the same server node, please change bigdata02 by localhost --> <!-- bigdata02 Installed for you MySQL Server node, please change to your own --> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <!-- SQL The name of the driver class does not need to be changed --> <description>Driver class name for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> <!-- connect MySQL User account name --> <description>username to use against metastore database</description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>QWer_1234</value> <!-- connect MySQL Login password of the user --> <description>password to use against metastore database</description> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive236/warehouse236</value> <description>hive default warehouse, if nessecory, change it</description> <!-- Optional configuration, which is used to specify Hive The data of the data warehouse is stored in HDFS Directory on --> </property> </configuration>
Step 5: add MySQL driver
Remember to add the MySQL driver package mysql-connector-java-5.1.40-bin.jar, which is placed in the lib directory under hive's installation root path. After all, hive needs to read and write mysql.
[bigdata@bigdata05 ~]$ cp ~/soft/mysql-connector-java-5.1.40-bin.jar ~/apps/apache-hive-2.3.6-bin/lib
Step 6: copy the configuration file of Hadoop cluster
Remember to place the core-site.xml and hdfs-site.xml configuration files in the Hadoop cluster in the conf directory under the Hive installation directory.
[bigdata@bigdata05 ~]$ cp $HADOOP_HOME/etc/hadoop/core-site.xml ~/apps/apachehive-2.3.6-bin/conf [bigdata@bigdata05 ~]$ cp $HADOOP_HOME/etc/hadoop/hdfs-site.xml ~/apps/apachehive-2.3.6-bin/conf
Step 7: after installation, configure environment variables
Add the following two lines:
export HIVE_HOME=/home/bigdata/apps/apache-hive-2.3.6-bin export PATH=$PATH:$HIVE_HOME/bin
Finally, don't forget to make the environment variable configuration effective:
[bigdata@bigdata05 ~]$ source ~/.bashrc
Step 8: verify Hive installation
[bigdata@bigdata05 ~]$ hive --service version
Step 9: initialize metabase
Note: when hive is a version before 1.x, it is OK not to initialize it. When hive is started for the first time, it will be initialized automatically, but whether enough tables in the metabase will be generated. It will be generated slowly during use. But finally initialize. If you are using version 2.x of hive, you must manually initialize the metabase. Use command:
[bigdata@bigdata05 ~]$ schematool -dbType mysql -initSchema
Remember: initialization only needs to be done once. If you use it for a period of time and execute this command again, you will get a new Hive.
Log of successful execution:
[bigdata@bigdata04 hadoop]$ schematool -dbType mysql -initSchema SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/bigdata/apps/apache-hive-2.3.6-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/bigdata/apps/hadoop2.7.7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Metastore connection URL: jdbc:mysql://bigdata02:3306/hivemetadb2003?createDatabaseIfNotExist=true&verifyServerCertificate=false&useSSL=false Metastore Connection Driver : com.mysql.jdbc.Driver Metastore connection User: root Starting metastore schema initialization to 2.3.0 Initialization script hive-schema-2.3.0.mysql.sql Initialization script completed schemaTool completed
Step 10: start Hive client
[bigdata@bigdata05 ~]$ hive [bigdata@bigdata05 ~]$ hive --service cli
Step 11: exit Hive
hive> quit; hive> exit;
Step 12: precautions
When using Hive, be sure to do three things:
- Ensure that the RDBMS metadata is started.
- Make sure HDFS is started. It can operate normally
- Ensure that the YARN cluster is started and can operate normally
2.1.5. HiveServer2 service
The deployment method is as follows:
First: modify the hdfs-site.xml configuration file of hadoop cluster: add a configuration message to enable webhdfs
<property> <name>dfs.webhdfs.enabled</name> <value>true</value> </property>
Second: modify the core-site.xml configuration file of hadoop cluster: add two pieces of configuration information: it indicates the proxy user setting hadoop cluster
<property> <name>hadoop.proxyuser.bigdata.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.bigdata.groups</name> <value>*</value> </property>
Modify the core-site.xml and hdfs-site.xml in the Hadoop cluster, and remember that all nodes can be modified. Restart the Hadoop cluster.
Third: start the Hiveserver2 service
[bigdata@bigdata05 ~]$ nohup hiveserver2 1>/home/bigdata/logs/hiveserver.log 2>/home/bigdata/logs/hiveserver.err &
Step 4: start beeline client
beeline -u jdbc:hive2://bigdata03:10000 -n bigdata
1.6. Hiveserver2 Web UI
Since version 2.0, Hive has provided a simple WEB UI interface for HiveServer2. In the interface, you can intuitively see the current linked session, history log, configuration parameters and measurement information.
Modified: $HIVE_HOME/conf/hive-site.xml configuration file:
<property> <name>hive.server2.webui.host</name> <value>bigdata03</value> </property> <property> <name>hive.server2.webui.port</name> <value>10002</value> </property>
You can refer to the official website: https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-WebUIforHiveServer2
Restart Hiveserver2 to access the Web UI: http://bigdata05:10002
2.2. Hive SQL syntax full operation
2.2.1. Database operation
create database mydb; create databse if not exists mydb comment 'create my db named dbname' with dbproperties ('a'='aaa','b'='bbb')
show create database mydb; -- View create statement desc database extended mydb; -- Query database details show databases; -- List all databases drop database mydb; -- Delete database drop database if exists mydb cascade; -- Delete database use mydb; -- Use database show tables; -- View the data table of the database show tables in mydb; -- View the data table of the database
2.2.2. Add table
Create internal table (managed_table)
create table mingxing(id int, name string, sex string, age int, department string) row format delimited fields terminated by ',';
create external table mingxing(id int, name string, sex string, age int, department string) row format delimited fields terminated by ',' location '/root/hivedata'; -- Note: specify when creating an external table location The location of must be a directory, not a single file
create table mingxing(id int, name string, sex string, age int, department string) partitioned by (city string) row format delimited fields terminated by ','; -- be careful: partition The field in the table cannot be the field declared in the table
create table mingxing(id int, name string, sex string, age int, department string) clustered by(id) sorted by(age desc) into 4 buckets row format delimited fields terminated by ',';
2.2.3. Delete table
drop table [if exists] mingxing;
2.2.4. Rename table
alter table mingxing rename to student;
2.2.5. Modification table
alter table mingxing add columns (province string); -- Add field alter table mingxing add columns (province string, xxx bigint); -- Add field drop(Not supported), alternatives can be used replace -- Delete field alter table mingxing change age newage string; -- Modify field alter table mingxing change age newage string first|after id; -- Modify field alter table mingxing replace columns(id int, name string, sex string); -- replace field
2.2.6. hive partition table
alter table mingxing add partition(city='beijing'); -- Add partition alter table mingxing add partition(city='beijing') partition(city='tianjin'); -- Add partition alter table mingxing drop if exists partition(city='beijing'); -- delete a partition alter table mingxing partition(city='beijing') set location '/home/hadoop/data/beijing'; -- Modify partition data path
2.2.7. Various common query and display commands
show databases; -- View Library: show create database mydb; -- View database creation statements show tables; -- View table show tables in mydb; -- View table show create table mingxing; -- View table creation statement show functions; -- View built-in function library show partitions mingxing; -- View table fields desc extended mingxing; -- View table details desc formatted mingxing; -- View the formatted details of the table
2.2.8. Import data in load mode
load data local inpath './student.txt' into table mingxing; -- Import data for local relative path load data local inpath './student.txt' overwrite into table mingxing; -- Import local relative path data overwrite import load data local inpath '/root/hivedata/student.txt' into table mingxing; -- Import local absolute path data load data inpath '/root/hivedata/student.txt' into table mingxing; -- Import HDFS Simple path data on load data inpath 'hdfs://hadoop01:9000/root/hivedata/student.txt' into table mingxing; -- Import data in full path mode on HDFS
2.2.9. Insert data into the table using the insert keyword
insert into table mingxing values(001,'huangbo','male',50,'MA'); -- Single data insertion
Single insertion mode
insert into table student select id,name,sex,age,department from mingxing; -- Note: the queried fields must be student Fields that exist in the table
Multiple insertion mode
from mingxing insert into table student1 select id,name,sex,age insert into table student2 select id,department;
from mingxing2 insert into table student1 partition(department='MA') select id,name,sex ,age where department='MA' insert into table student1 partition(department='CS') select id,name,sex ,age where department='CS';
Static partition insert
load data local inpath '/root/hivedata/student.txt' into table student partition(city='henan');
Dynamic partition insertion
create table student(name string, department string) partitioned by (id int) -- student Table fields: name,department， The partition field is id ..... insert into table student partition(id) select name,department,id from mingxing2; -- The query fields are: name,department,id，Partition field -- Note: the partition field inserted by dynamic partition must be the last field in the query statement
CTAS(create table... as select...) directly stores the query results in a new table
create table student as select id,name,age,department from mingxing; -- Note: as like as two peas, the fields and the fields of the query fields appear in the new automatically table.
2.2.10. Use of like keyword
create table student like mingxing; -- Imitation table
2.2.11. Using insert to export data to local or hdfs
Single mode export data to local
insert overwrite local directory '/root/outputdata' select id,name,sex,age,department from mingxing;
Multi mode export data to local
from mingxing insert overwrite local directory '/root/outputdata1' select id, name insert overwrite local directory '/root/outputdata2' select id, name,age
Export simple path mode to hdfs
insert overwrite directory '/root/outputdata' select id,name,sex,age,department from mingxing;
Query data to hdfs in full path mode
insert overwrite directory 'hdfs://hadoop01:9000/root/outputdata1' select id,name,sex,age,department from mingxing;
2.2.12. Clear database table
truncate table mingxing;
2.2.13. Query data
select * from mingxing join student where ... group by ... order by ... limit...
Query global ordered data
select * from mingxing order by age desc , id asc;
If the amount of data is too large, we use local sorting:
set mapred.reduce.tasks=3; set mapreduce.job.reduces=3; select * from mingxing sort by id asc;
set hive.enforce.bucketing = true; select * from mingxing distribute by sex;
Query sorted bucket data
select * from mingxing cluster by id sort by id desc, age asc;
2.2.14. Five link queries
select student.*, mingxing.* from student join mingxing on student.id = mingxing.id select student.*, mingxing.* from student inner join mingxing on student.id = mingxing.id
left outer join:
select student.*, mingxing.* from student left join mingxing on student.id = mingxing.id select student.*, mingxing.* from student left outer join mingxing on student.id = mingxing.id
right outer join
select student.*, mingxing.* from student right join mingxing on student.id = mingxing.id select student.*, mingxing.* from student right outer join mingxing on student.id = mingxing.id
full outer join
select student.*, mingxing.* from student full join mingxing on student.id = mingxing.id select student.*, mingxing.* from student full outer join mingxing on student.id = mingxing.id
in/exists hive efficiently implements left semi join:
select student.*, mingxing.* from student left semi join mingxing on student.id = mingxing.id;
select student.* from student where student.id in(select distinct id from mingxing);
95001,Li Yong,male,20,CS 95002,Liu Chen,female,19,IS 95003,Wang Min,female,22,MA 95004,Zhang Li,male,19,IS 95005,Liu Gang,male,18,MA 95006,Sun Qing,male,23,CS 95007,Yi Siling,female,19,MA 95008,Li Na,female,18,CS 95009,Dream round,female,18,MA 95010,Kong Xiaotao,male,19,CS 95011,Tino ,male,18,MA
create database if not exists exercise_db; use exercise_db; drop table if exists exercise_student; create table exercise_student(id int, name string, sex string, age int, department string) row format delimited fields terminated by ','; load data local inpath "/home/bigdata/students.txt" into table exercise_student; select * from exercise_student;
2.3. Hive's four tables
hive table has a mapping relationship: a table may store and manage many data files!
How are hive tables and data files mapped? select count(*) fro table;
table and HDFS directories
Default path of data warehouse:/user/hive/warehouse/ Library path: nxde1.db Table path: nx_student Data file: stduent.txt Full path to the data store directory of the current table: /user/hive/warehouse/nxde1.db/nx_student/stduent.txt Mapping relationship: Table: nx_student HDFS catalog:/user/hive/warehouse/nxde1.db/nx_student/
2.3.1. Internal table
create table nx_student()
When you create a table, the default organization method is used to store the data of the table in the default warehouse path
2.3.2. External table
When you create a table, in fact, the data is already in HDFS, but not in your default directory. You need to specify the directory when you create the table:
create external table nx_student() row .... location "/aa/bb/ccc"
When deleting a table:
- Internal table: delete metadata and delete data files
- External table: delete metadata without deleting data files
2.3.3. Zoning table
Default path of data warehouse:/user/hive/warehouse/ Library path: nxde1.db Table path: nx_student Partition: city=beijing city=shanghai Data file: stduent.txt Full path to the data store directory of the current table: /user/hive/warehouse/nxde1.db/nx_student/city=shanghai/stduent.txt
create table student_ptn(id int, name string) partitioned by(age int)
Important: the partition field must not be the same as the table, but when used, the partition field is not different from the table field!
Previously, the data was directly placed in the table directory. Now, create folders in the table directory. Each folder guarantees part of the data of the table. This means that the data of a table can be managed by category
select * from student_ptn where city = beijing; select * from student_ptn where date > 2020-08-21; Effect: scan 3 partitions
In the production environment, most of the fact tables are partitioned tables (most of the partitioned fields are also of date type: the probability of partitioning by day is the greatest)
2.3.4. Barrel separation table
Partitions are organized in the form of folders
Pail is as like as two peas in the form of a file. The rule of dividing a barrel is exactly the same as the HashPartitioner of the default mapreduce.
create table student_bck(id int, name string, age int) clustered by (name) sorted by (age desc) into 3 buckets; According to the data in this table name Fields are divided into three files for management, and the data in each file will be managed according to age The descending sort rule is MR Default in hash The logic of the hash component is the same
1,Partition table: used to submit query efficiency. Enterprise best practices: 1,If you meet a lot of people SQL In, it will be filtered by a field. So when you build a table. You should use this field as a partition field to create a table 2,If your data basically only analyzes incremental data, not full data, you should partition it according to time 1,Bucket table: also used to improve query efficiency + Used for sampling Enterprise best practices: 1,If a field in a table is often regarded as a join The query criteria should be optimized by creating a bucket table select a.*, b.* from a join b on a.id = b.id; a Table and b All watches should be divided into barrels, and the rules of barrel division should be the same. (the bucket distribution fields are consistent, and the number of buckets is a multiple)
Data warehouse: fact table and dimension table
Query: fact table join dimension table
Whether it is partition or bucket, data skew will occur!
- Data skew in hive 3-5
- There are 11 kinds of data skew in spark
- If you still choose to use hive, you accept MR as the engine by default
- Hive On Spark is not as stable as expected
- SparkSQL is fully hive compatible.
hive + sparkSQL
DAG execution engine: spark tez
Introduction: OLAP engine
- clickhouse kylin druid kudu doris ...>