1. Experimental Purpose
(1) Understand Hive's role as a data warehouse in the Hadoop architecture.
(2) Skilled in using commonly used HiveQL.
2. Experimental Platform
- Operating system: Ubuntu 18.04 (or Ubuntu 16.04);
- Hadoop version: 3.1.3;
- Hive version: 3.1.2;
- JDK version: 1.8.
3. Data Sets
Dead work:
-
Provided by Hive Programming Guide (O'Reilly series, People's Posts and Telecommunications Press), download address:
https://raw.githubusercontent.com/oreillymedia/programming_hive/master/prog-hive-1st-ed-data.zip -
Alternate download address:
https://www.cocobolo.top/FileServer/prog-hive-1st-ed-data.zip -
Slow download refers to the resources I uploaded: Forest Rain Hive Dataset Download
After decompression, two files, stocks.csv and dividends.csv, are available for this experiment.
Enter your Downloads folder, right-click to unzip the data package you just downloaded, enter the prog-hive-1st-ed-data folder, and right-click to open the terminal:
cd ~/Downloads/prog-hive-1st-ed-data sudo cp ./data/stocks/stocks.csv /usr/local/hive sudo cp ./data/dividends/dividends.csv /usr/local/hive
Enter the Hadoop directory and start Hadoop:
cd /usr/local/hadoop sbin/start-dfs.sh
Start MySQL:
service mysql start
Switch to the Hive directory and start MySQL and Hive:
cd /usr/local/hive bin/hive
IV. EXPERIMENTAL STEPS
(1) Create an internal table, stocks, with a field delimiter of English commas, with the following table structure:
stocks table structure:
col_name | data_type |
---|---|
exchange | string |
symbol | string |
ymd | string |
price_open | float |
price_high | float |
price_low | float |
price_close | float |
volume | int |
price_adj_close | float |
Code:
create table if not exists stocks ( `exchange` string, `symbol` string, `ymd` string, `price_open` float, `price_high` float, `price_low` float, `price_close` float, `volume` int, `price_adj_close` float ) row format delimited fields terminated by ',';
View Table:
hive> describe stocks; OK exchange string symbol string ymd string price_open float price_high float price_low float price_close float volume int price_adj_close float Time taken: 0.062 seconds, Fetched: 9 row(s) hive>
(2) Create an external partition table dividends (partition fields exchange and symbol) with field delimiters in English commas, and the table structure is as follows:
dividends table structure
col_name | data_type |
---|---|
ymd | string |
dividend | float |
exchange | string |
symbol | string |
Code:
create external table if not exists dividends ( `ymd` string, `dividend` float ) partitioned by(`exchange` string ,`symbol` string) row format delimited fields terminated by ',';
View Table:
hive> describe dividends; OK ymd string dividend float exchange string symbol string # Partition Information # col_name data_type comment exchange string symbol string Time taken: 0.106 seconds, Fetched: 9 row(s) hive>
(3) Import data from the stocks.csv file into the stocks table:
Code:
load data local inpath '/usr/local/hive/stocks.csv' overwrite into table stocks;
(4) Create an unpartitioned external table dividends_unpartitioned and import data from dividends.csv with the following table structure:
dividends_unpartitioned table structure
col_name | data_type |
---|---|
ymd | string |
dividend | float |
exchange | string |
symbol | string |
Code:
create external table if not exists dividends_unpartitioned ( `exchange` string , `symbol` string, `ymd` string, `dividend` float ) row format delimited fields terminated by ',';
Import data:
load data local inpath '/usr/local/hive/dividends.csv' overwrite into table dividends_unpartitioned;
(5) By dividends_unpartitioned query statement that uses the Hive auto-partitioning feature to insert corresponding data into each partition of the partition table dividends.
Code:
set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.max.dynamic.partitions.pernode=1000; insert overwrite table dividends partition(`exchange`,`symbol`) select `ymd`,`dividend`,`exchange`,`symbol` from dividends_unpartitioned;
(6) Query the closing price (price_close) of IBM (symbol = IBM) for all dividend payment days (corresponding records in the dividends table) since 2000.
The operation statements are as follows:
select s.ymd,s.symbol,s.price_close from stocks s LEFT SEMI JOIN dividends d ON s.ymd=d.ymd and s.symbol=d.symbol where s.symbol='IBM' and year(ymd)>=2000;
The output is as follows (fold part output):
2010-02-08 IBM 121.88 2009-11-06 IBM 123.49 2009-08-06 IBM 117.38 ... 2000-05-08 IBM 109.75 2000-02-08 IBM 118.81 Time taken: 8.75 seconds, Fetched: 41 row(s)
(7) Query the rise and fall of Apple (symbol = AAPL) on each trading day in October 2008, showing rise, fall, unchange.
The operation statements are as follows:
select ymd, case when price_close-price_open>0 then 'rise' when price_close-price_open<0 then 'fall' else 'unchanged' end as situation from stocks where symbol='AAPL' and substring(ymd,0,7)='2008-10';
The output is as follows (fold part output):
2008-10-31 rise 2008-10-30 rise ... 2008-10-02 fall 2008-10-01 fall Time taken: 0.1 seconds, Fetched: 23 row(s)
(8) Query the exchange, the symbol, the date, the closing price, the opening price and the difference between the closing price and the opening price of the record with the closing price higher than the opening price.
The operation statements are as follows:
select `exchange`,`symbol`,`ymd`,price_close,price_open,price_close-price_open as `diff` from ( select * from stocks order by price_close-price_open desc limit 1 )t;
The output is as follows:
NASDAQ INFY 2000-02-11 670.06 534.5 135.56 Time taken: 4.476 seconds, Fetched: 1 row(s)
9) Query the annual adjusted closing price (price_adj_close) of Apple (symbol=AAPL) from the stocks table for years with an average adjusted closing price greater than $50 and for years with an average adjusted closing price.
The operation statements are as follows:
select year(ymd) as `year`, avg(price_adj_close) as avg_price from stocks where `exchange`='NASDAQ' and symbol='AAPL' group by year(ymd) having avg_price > 50;
The output is as follows:
2006 70.81063753105255 2007 128.27390423049016 2008 141.9790115054888 2009 146.81412711976066 2010 204.72159912109376 Time taken: 2.347 seconds, Fetched: 5 row(s)
(10) Query the stock codes of the top three companies in the annual average adjusted closing price (price_adj_close) and the annual average adjusted closing price.
The operation statements are as follows:
select t2.`year`,symbol,t2.avg_price from ( select *,row_number() over(partition by t1.`year` order by t1.avg_price desc) as `rank` from ( select year(ymd) as `year`, symbol, avg(price_adj_close) as avg_price from stocks group by year(ymd),symbol )t1 )t2 where t2.`rank`<=3;
The output is as follows (fold part output):
NULL stock_symbol NULL 1962 IBM 2.0072222134423634 1962 GE 0.16876984293025638 ... 2009 GTC 174.11607115609306 2010 ISRG 319.75360107421875 2010 AMEN 313.875 2010 GTC 214.36719848632814 Time taken: 7.715 seconds, Fetched: 140 row(s)