HAWQ Replacing Traditional Number Warehouse Practice (17) - Accumulative Measurement of Fact Table Technology

Keywords: snapshot SQL shell

Cumulative Measures refer to aggregating data from the first element in the sequence to the current element, such as statistics of cumulative sales from January to the current month of each year. This article explains how to realize the cumulative monthly sales volume and amount in the sales order example, and makes corresponding modifications to the data warehouse mode, initial loading and periodic loading. The cumulative measure is semi-additive, and its initial loading is more complex.

1. Establishment of cumulative measurement fact sheet

Execute the following script to create the month_end_balance_fact fact table to store the monthly cumulative value of the sales order amount and quantity.
set search_path=tds;
create table month_end_balance_fact (    
    year_month int,    
    product_sk int,    
    month_end_amount_balance numeric(10,2),    
    month_end_quantity_balance int ); 

comment on table month_end_balance_fact is 'Cumulative Measure Fact Table';   
comment on column month_end_balance_fact.year_month is 'Years';   
comment on column month_end_balance_fact.product_sk is 'Product Agent Key';        
comment on column month_end_balance_fact.month_end_amount_balance is 'Cumulative amount';        
comment on column month_end_balance_fact.month_end_quantity_balance is 'Cumulative Quantity';   

II. Initial Loading

Now load the data in the month_end_sales_order_fact table into the month_end_balance_facttable, and the script that initially loads the month_end_balance_facttable is shown below. This script loads cumulative monthly sales order summary data, which accumulates from January to that month every year, and does not span years.
insert into month_end_balance_fact   
select a.year_month,
       b.product_sk,
       sum(b.month_order_amount) month_order_amount,    
       sum(b.month_order_quantity) month_order_quantity 
  from (select distinct 
               year_month, 
               year_month/100 year1, 
               year_month - year_month/100*100 month1
          from month_end_sales_order_fact) a,
       (select *,
               year_month/100 year1, 
               year_month - year_month/100*100 month1,
               max(year_month) over () max_year_month  
          from month_end_sales_order_fact) b
 where a.year_month <= b.max_year_month
   and a.year1 = b.year1 and b.month1 <= a.month1
 group by a.year_month, b.product_sk;
Subqueries retrieve the data of the month_end_sales_order_fact table and its year-month and maximum-month proxy keys. The outer query aggregates the cumulative sales data from January to that month, and the a. year_month <= B. max_year_month condition is used to limit the statistics to the maximum month in existence.
To confirm that the initial loading is correct, after executing the initial loading script, query the month_end_sales_order_fact s and month_end_balance_fact s tables, respectively.
Query cycle snapshots:
select year_month,  
       product_sk psk,  
       month_order_amount amt,  
       month_order_quantity qty  
  from month_end_sales_order_fact 
 order by year_month, psk;
The query structure is shown in Figure 1.
Figure 1

Query cumulative measures:
select year_month,
       product_sk psk,  
       month_end_amount_balance amt,  
       month_end_quantity_balance qty  
  from month_end_balance_fact
 order by year_month, psk;
The query structure is shown in Figure 2.
Figure 2

As can be seen, the sales amount of goods in March 2016 was accumulated to April 2016, and the sales amount of goods in March and April 2016 was accumulated to May 2016, etc. This kind of cumulative measure is sometimes called sliding order problem.

III. Periodic Loading

The month_balance_sum.sql script shown below is used to periodically load sales order cumulative metrics, which are executed once a month to load last month's data. The script can be executed after the monthly cycle snapshot table is loaded regularly. Note that the fact that you are late has a significant impact on cumulative measures. For example, the January 2016 data did not enter the data warehouse until January 2017, so the cumulative measures for each month after February 2016 would change.
insert into month_end_balance_fact    
select year_month,    
         product_sk,    
         sum(month_order_amount),    
         sum(month_order_quantity)    
  from (select *    
          from month_end_sales_order_fact 
         where year_month = :v_year_month  
       union all    
        select :v_year_month,  
               product_sk product_sk,  
               month_end_amount_balance month_order_amount,  
               month_end_quantity_balance month_order_quantity   
          from month_end_balance_fact    
         where year_month in 
(select max(case when :v_year_month - :v_year_month/100*100 = 1 then 0 else year_month end)    
   from month_end_balance_fact)) t  
 group by year_month, product_sk;
Subqueries combine cumulative metrics and monthly cycle snapshot tables to increase the cumulative data of last month. The outermost query executes grouping aggregation of sales data by month and product. The innermost case statement is used to revert to zero and accumulate every January. v_year_month is the year-month parameter.

IV. Testing

Implement the monthly cycle snapshot function to load June 2017 data.
select fn_month_sum(201706);
Execute cumulative metrics to load scripts periodically, and pass the output of shell command `date +%Y%m'into the month_balance_sum.sql file as the year and month parameters.
su - gpadmin -c 'export PGPASSWORD=123456;psql -U dwtest -d dw -h hdp3 -v v_year_month=''`date +%Y%m`'' -f ~/month_balance_sum.sql'
Execute the same query as the previous initial load, and the query results of the periodic snapshot table and the cumulative metric table are shown in Figures 3 and 4, respectively.
Figure 3

Figure 4


As you can see, the amount and quantity of goods sold in May 2017 were accumulated to June 2017. Product 1, 2 and 5 accumulated sales data for 5 and 6 months. Product 3 and 4 did not sell in June, so sales data for May was extended to June.

V. Query

Digital measures in fact tables can be divided into three categories: additive, semi-additive and non-additive. Additivity metrics can be aggregated according to any dimension associated with the fact table, that is to say, the sum of the metrics aggregated according to any dimension is the same, and most of the metrics in the fact table belong to this category. Semi-additive measures can aggregate some dimensions, but not all. Balance is a common semi-additive measure, which can be added across all dimensions except the time dimension. There are other measures that are totally unacceptable, such as proportions. For non-additive measures, it is better to store as many additive components as possible, such as molecule and denominator of proportion, and aggregate these components into the final result set. The calculation of non-additive measures usually takes place in BI or OLAP layers.
Cumulative measures must be used carefully because they are "semi-additive". A semi-additive measure is not additive in some dimensions (usually time dimensions). For example, the accumulative sales amount at the end of the month can be accumulatively accumulated by the product.  
dw=> select year_month, sum(month_end_amount_balance) s  
dw->   from month_end_balance_fact    
dw->  group by year_month  
dw->  order by year_month; 
 year_month |     s     
------------+-----------
     201603 | 191158.00
     201604 | 345600.00
     201605 | 455772.00
     201606 | 572190.00
     201705 | 253400.00
     201706 | 294876.00
(6 rows)
The cumulative end-of-month amount by month:
dw=> select product_name, sum(month_end_amount_balance) s  
dw->   from month_end_balance_fact a,  
dw->        product_dim b  
dw->  where a.product_sk = b.product_sk  
dw->  group by product_name
dw->  order by product_name;
  product_name   |     s     
-----------------+-----------
 flat panel      |  99332.00
 floppy drive    | 940933.00
 hard disk drive | 941337.00
 keyboard        | 125220.00
 lcd panel       |   6174.00
(5 rows)
The above query results are incorrect. The correct result should be the same as the following query on the month_end_sales_order_fact table.
dw=> select product_name, sum(month_order_amount) s  
dw->   from month_end_sales_order_fact a,  
dw->        product_dim b  
dw->  where a.product_sk = b.product_sk  
dw->  group by product_name
dw->  order by product_name; 
  product_name   |     s     
-----------------+-----------
 flat panel      |  49666.00
 floppy drive    | 362393.00
 hard disk drive | 384533.00
 keyboard        |  67387.00
 lcd panel       |   3087.00
(5 rows)

Posted by carleyvibe on Sun, 23 Jun 2019 16:31:35 -0700