Experiment 6 is familiar with Hive's basic operations

Keywords: Big Data hive

1. Experimental Purpose

(1) Understand Hive's role as a data warehouse in the Hadoop architecture.
(2) Skilled in using commonly used HiveQL.

2. Experimental Platform

  • Operating system: Ubuntu 18.04 (or Ubuntu 16.04);
  • Hadoop version: 3.1.3;
  • Hive version: 3.1.2;
  • JDK version: 1.8.

3. Data Sets

Dead work:

After decompression, two files, stocks.csv and dividends.csv, are available for this experiment.

Enter your Downloads folder, right-click to unzip the data package you just downloaded, enter the prog-hive-1st-ed-data folder, and right-click to open the terminal:

cd ~/Downloads/prog-hive-1st-ed-data
sudo cp ./data/stocks/stocks.csv /usr/local/hive
sudo cp ./data/dividends/dividends.csv /usr/local/hive

Enter the Hadoop directory and start Hadoop:

cd /usr/local/hadoop
sbin/start-dfs.sh

Start MySQL:

service mysql start

Switch to the Hive directory and start MySQL and Hive:

cd /usr/local/hive
bin/hive

IV. EXPERIMENTAL STEPS

(1) Create an internal table, stocks, with a field delimiter of English commas, with the following table structure:

stocks table structure:

col_namedata_type
exchangestring
symbolstring
ymdstring
price_openfloat
price_highfloat
price_lowfloat
price_closefloat
volumeint
price_adj_closefloat

Code:

create table if not exists stocks
(
`exchange` string,
`symbol` string,
`ymd` string,
`price_open` float,
`price_high` float,
`price_low` float,
`price_close` float,
`volume` int,
`price_adj_close` float
)
row format delimited fields terminated by ',';

View Table:

hive> describe stocks;
OK
exchange            	string              	                    
symbol              	string              	                    
ymd                 	string              	                    
price_open          	float               	                    
price_high          	float               	                    
price_low           	float               	                    
price_close         	float               	                    
volume              	int                 	                    
price_adj_close     	float               	                    
Time taken: 0.062 seconds, Fetched: 9 row(s)
hive>

(2) Create an external partition table dividends (partition fields exchange and symbol) with field delimiters in English commas, and the table structure is as follows:

dividends table structure

col_namedata_type
ymdstring
dividendfloat
exchangestring
symbolstring

Code:

create external table if not exists dividends
(
`ymd` string,
`dividend` float
)
partitioned by(`exchange` string ,`symbol` string)
row format delimited fields terminated by ',';

View Table:

hive> describe dividends;
OK
ymd                 	string              	                    
dividend            	float               	                    
exchange            	string              	                    
symbol              	string              	                    
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
exchange            	string              	                    
symbol              	string              	                    
Time taken: 0.106 seconds, Fetched: 9 row(s)
hive>

(3) Import data from the stocks.csv file into the stocks table:

Code:

load data local inpath '/usr/local/hive/stocks.csv' overwrite into table stocks;

(4) Create an unpartitioned external table dividends_unpartitioned and import data from dividends.csv with the following table structure:

dividends_unpartitioned table structure

col_namedata_type
ymdstring
dividendfloat
exchangestring
symbolstring

Code:

create external table if not exists dividends_unpartitioned
(
`exchange` string ,
`symbol` string,
`ymd` string,
`dividend` float
)
row format delimited fields terminated by ',';

Import data:

load data local inpath '/usr/local/hive/dividends.csv' overwrite into table dividends_unpartitioned;

(5) By dividends_unpartitioned query statement that uses the Hive auto-partitioning feature to insert corresponding data into each partition of the partition table dividends.

Code:

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=1000;
insert overwrite table dividends partition(`exchange`,`symbol`) select `ymd`,`dividend`,`exchange`,`symbol` from dividends_unpartitioned;

(6) Query the closing price (price_close) of IBM (symbol = IBM) for all dividend payment days (corresponding records in the dividends table) since 2000.

The operation statements are as follows:

select s.ymd,s.symbol,s.price_close
from stocks s 
LEFT SEMI JOIN 
dividends d
ON s.ymd=d.ymd and s.symbol=d.symbol
where s.symbol='IBM' and year(ymd)>=2000;

The output is as follows (fold part output):

2010-02-08	IBM	121.88
2009-11-06	IBM	123.49
2009-08-06	IBM	117.38
...
2000-05-08	IBM	109.75
2000-02-08	IBM	118.81
Time taken: 8.75 seconds, Fetched: 41 row(s)

(7) Query the rise and fall of Apple (symbol = AAPL) on each trading day in October 2008, showing rise, fall, unchange.

The operation statements are as follows:

select ymd,
case
    when price_close-price_open>0 then 'rise'
    when price_close-price_open<0 then 'fall'
    else 'unchanged'
end as situation
from stocks
where symbol='AAPL' and substring(ymd,0,7)='2008-10';

The output is as follows (fold part output):

2008-10-31	rise
2008-10-30	rise
...
2008-10-02	fall
2008-10-01	fall
Time taken: 0.1 seconds, Fetched: 23 row(s)

(8) Query the exchange, the symbol, the date, the closing price, the opening price and the difference between the closing price and the opening price of the record with the closing price higher than the opening price.

The operation statements are as follows:

select `exchange`,`symbol`,`ymd`,price_close,price_open,price_close-price_open as `diff`
from
(
    select *
    from stocks
    order by price_close-price_open desc
    limit 1
)t;

The output is as follows:

NASDAQ	INFY	2000-02-11	670.06	534.5	135.56
Time taken: 4.476 seconds, Fetched: 1 row(s)

9) Query the annual adjusted closing price (price_adj_close) of Apple (symbol=AAPL) from the stocks table for years with an average adjusted closing price greater than $50 and for years with an average adjusted closing price.

The operation statements are as follows:

select
    year(ymd) as `year`,
    avg(price_adj_close) as avg_price from stocks
where `exchange`='NASDAQ' and symbol='AAPL'
group by year(ymd)
having avg_price > 50;

The output is as follows:

2006	70.81063753105255
2007	128.27390423049016
2008	141.9790115054888
2009	146.81412711976066
2010	204.72159912109376
Time taken: 2.347 seconds, Fetched: 5 row(s)

(10) Query the stock codes of the top three companies in the annual average adjusted closing price (price_adj_close) and the annual average adjusted closing price.

The operation statements are as follows:

select t2.`year`,symbol,t2.avg_price
from
(
    select
        *,row_number() over(partition by t1.`year` order by t1.avg_price desc) as `rank`
    from
    (
        select
            year(ymd) as `year`,
            symbol,
            avg(price_adj_close) as avg_price
        from stocks
        group by year(ymd),symbol
    )t1
)t2
where t2.`rank`<=3;

The output is as follows (fold part output):

NULL	stock_symbol	NULL
1962	IBM	2.0072222134423634
1962	GE	0.16876984293025638
...
2009	GTC	174.11607115609306
2010	ISRG	319.75360107421875
2010	AMEN	313.875
2010	GTC	214.36719848632814
Time taken: 7.715 seconds, Fetched: 140 row(s)

V. Summary

Posted by abcd1234 on Thu, 25 Nov 2021 09:12:34 -0800