HAWQ Data Warehouse Practice (9) - Degraded Dimensions of Dimension Table Technology

Keywords: snapshot Attribute SQL

Degraded dimension technology reduces the number of dimensions and simplifies the dimension data warehouse model. Simple schemas are easier to understand than complex ones and have better query performance.
Sometimes, there is nothing in the dimension table except the business primary key. For example, in this sales order example, the order dimension table has no other attributes except the order number, which is the primary key of the transaction table, and this dimension is the degenerate dimension. Primary keys in business systems are usually not allowed to be modified. Sales orders can only be added, the existing order number can not be modified, and the order record can not be deleted. Therefore, the order dimension table will not have the issue of historical data versions. Degenerate dimensions are common in transaction and cumulative snapshot fact tables.
Each row record in the sales order fact table includes the order number proxy key as the degeneration dimension. In the operational system, the sales order form is the most detailed transaction table, and the order number is the main key of the order form. Each order can be positioned by the order number. Other attributes of the order, such as customers, products, etc., all depend on the order number. That is to say, the order number associates the tables related to the order attributes. However, in the dimension model, the order number proxy key in the fact table is usually not associated with other tables of order attributes. The attributes of all concerns in the order fact table can be categorized into different dimensions, such as the order date related to the date dimension, the customer related to the customer dimension, and so on. The main reason for retaining the order number in the fact table is that it is used to connect the data warehouse with the operational system. It can also act as the primary key of the fact table. In some cases, one or two attributes may still belong to the order rather than to other dimensions. Of course, the order dimension is no longer a degenerate dimension.
Degraded dimensions are usually retained as identifiers for operational transactions. In fact, the order number can be added to the fact table as an attribute. In this way, the order dimension has no data needed by the data warehouse, and then the order dimension can be degraded. Relevant data of degraded dimensions need to be migrated to fact tables, and then degraded dimensions need to be deleted.
Note that control numbers in operational transactions, such as order number, invoice number, bill of lading number, etc., usually generate empty dimensions and are represented as degenerate dimensions in the transaction fact table.

Degraded Order Dimension

Dimensional degradation techniques are used to identify data and analyze data columns that have never been used. For example, the order_number column of the order dimension might be such a column. But if the user wants to see the details of the transaction, he also needs the order number. Therefore, before degrading the order dimension, migrate the order number to the sales_order_fact facts fact table. Figure 1 shows the modified pattern.
Figure 1

The following four-step degenerate order_dim dimension table is executed sequentially:
  • Add order_number column to sales_order_fact table
  • Migrate the order number in the order_dim table to the sales_order_fact table
  • Delete the order_sk column in the sales_order_fact table
  • Delete order_dim table
The following script completes the steps required for all degraded order dimensions.
set search_path=tds;

alter table sales_order_fact rename to sales_order_fact_old; 
create table sales_order_fact as
select t2.order_number,
       t1.customer_sk,
       t1.product_sk,
       t1.order_date_sk,
       t1.year_month,
       t1.order_amount,
       t1.order_quantity,
       t1.request_delivery_date_sk  
  from sales_order_fact_old t1 inner join order_dim t2 on t1.order_sk = t2.order_sk;

comment on table sales_order_fact is 'Sales Order Fact Sheet';  
comment on column sales_order_fact.order_number is 'Order number';  
comment on column sales_order_fact.customer_sk is 'Customer dimension proxy key';  
comment on column sales_order_fact.product_sk is 'Product dimension agent key';  
comment on column sales_order_fact.order_date_sk is 'Date dimension proxy key';  
comment on column sales_order_fact.year_month is 'Annual and Monthly Zoning Key';  
comment on column sales_order_fact.order_amount is 'sales amount';  
comment on column sales_order_fact.order_quantity is 'Number';  
comment on column sales_order_fact.request_delivery_date_sk is 'Request Delivery Date Agent Key';  

drop table sales_order_fact_old;  
drop table order_dim;
HAWQ does not provide UPDATE functionality, so to update the order number of existing data, only reload all data. In this case, the value of the proxy key in the order number dimension table is the same as that of the order number business primary key. In fact, you can simply rename the order_sk field of the fact table to order_number. But this is only a special case. Often the values of proxy keys and business primary keys are different, so data is still regenerated in a standard way.

Modifying Periodic Data Loading Function

Another thing to do after degrading a dimension is to modify the periodic data loading function. The order number needs to be added to the sales order fact table instead of importing the order dimension. The modified function is as follows.
create or replace function fn_regular_load ()          
returns void as          
$$          
declare          
    -- Set up scd Effective time        
    v_cur_date date := current_date;            
    v_pre_date date := current_date - 1;        
    v_last_load date;        
begin        
    -- Analysis of external tables        
    analyze ext.customer;        
    analyze ext.product;        
    analyze ext.sales_order;        
        
    -- Loading external table data into the original table        
    truncate table rds.customer;          
    truncate table rds.product;         
        
    insert into rds.customer select * from ext.customer;         
    insert into rds.product select * from ext.product;        
    insert into rds.sales_order   
    select order_number,  
           customer_number,  
           product_code,  
           order_date,  
           entry_date,  
           order_amount,  
           order_quantity,  
           request_delivery_date   
      from ext.sales_order;        
            
    -- Analysis rds Table of patterns        
    analyze rds.customer;        
    analyze rds.product;        
    analyze rds.sales_order;        
        
    -- Set up cdc Upper Limit Time        
    select last_load into v_last_load from rds.cdc_time;        
    truncate table rds.cdc_time;        
    insert into rds.cdc_time select v_last_load, v_cur_date;        
        
    -- Loading customer dimension        
    insert into tds.customer_dim        
    (customer_number,        
     customer_name,        
     customer_street_address,        
     customer_zip_code,        
     customer_city,        
     customer_state,      
     shipping_address,       
     shipping_zip_code,       
     shipping_city,       
     shipping_state,        
     isdelete,        
     version,        
     effective_date)        
    select case flag         
                when 'D' then a_customer_number        
                else b_customer_number        
            end customer_number,        
           case flag         
                when 'D' then a_customer_name        
                else b_customer_name        
            end customer_name,        
           case flag         
                when 'D' then a_customer_street_address        
                else b_customer_street_address        
            end customer_street_address,        
           case flag         
                when 'D' then a_customer_zip_code        
                else b_customer_zip_code        
            end customer_zip_code,        
           case flag         
                when 'D' then a_customer_city        
                else b_customer_city        
            end customer_city,        
           case flag         
                when 'D' then a_customer_state        
                else b_customer_state        
            end customer_state,        
           case flag         
                when 'D' then a_shipping_address        
                else b_shipping_address        
            end shipping_address,      
           case flag         
                when 'D' then a_shipping_zip_code        
                else b_shipping_zip_code        
            end shipping_zip_code,        
           case flag         
                when 'D' then a_shipping_city        
                else b_shipping_city        
            end shipping_city,        
           case flag         
                when 'D' then a_shipping_state        
                else b_shipping_state        
            end shipping_state,      
           case flag         
                when 'D' then true        
                else false        
            end isdelete,        
           case flag         
                when 'D' then a_version        
                when 'I' then 1        
                else a_version + 1        
            end v,        
           v_pre_date        
      from (select a.customer_number a_customer_number,        
                   a.customer_name a_customer_name,        
                   a.customer_street_address a_customer_street_address,        
                   a.customer_zip_code a_customer_zip_code,        
                   a.customer_city a_customer_city,        
                   a.customer_state a_customer_state,       
                   a.shipping_address a_shipping_address,        
                   a.shipping_zip_code a_shipping_zip_code,        
                   a.shipping_city a_shipping_city,        
                   a.shipping_state a_shipping_state,       
                   a.version a_version,        
                   b.customer_number b_customer_number,        
                   b.customer_name b_customer_name,        
                   b.customer_street_address b_customer_street_address,        
                   b.customer_zip_code b_customer_zip_code,        
                   b.customer_city b_customer_city,        
                   b.customer_state b_customer_state,       
                   b.shipping_address b_shipping_address,        
                   b.shipping_zip_code b_shipping_zip_code,        
                   b.shipping_city b_shipping_city,        
                   b.shipping_state b_shipping_state,       
                   case when a.customer_number is null then 'I'        
                        when b.customer_number is null then 'D'        
                        else 'U'         
                    end flag        
              from v_customer_dim_latest a         
              full join rds.customer b on a.customer_number = b.customer_number         
             where a.customer_number is null -- Newly added        
                or b.customer_number is null -- delete        
                or (a.customer_number = b.customer_number         
                    and not         
                           (coalesce(a.customer_name,'') = coalesce(b.customer_name,'')         
                        and coalesce(a.customer_street_address,'') = coalesce(b.customer_street_address,'')         
                        and coalesce(a.customer_zip_code,0) = coalesce(b.customer_zip_code,0)        
                        and coalesce(a.customer_city,'') = coalesce(b.customer_city,'')         
                        and coalesce(a.customer_state,'') = coalesce(b.customer_state,'')      
                        and coalesce(a.shipping_address,'') = coalesce(b.shipping_address,'')         
                        and coalesce(a.shipping_zip_code,0) = coalesce(b.shipping_zip_code,0)        
                        and coalesce(a.shipping_city,'') = coalesce(b.shipping_city,'')         
                        and coalesce(a.shipping_state,'') = coalesce(b.shipping_state,'')      
                        ))) t        
             order by coalesce(a_customer_number, 999999999999), b_customer_number limit 999999999999;        
     
    -- heavy load PA Customer dimension      
    truncate table pa_customer_dim;        
    insert into pa_customer_dim        
    select customer_sk,       
           customer_number,          
           customer_name,          
           customer_street_address,          
           customer_zip_code,          
           customer_city,          
           customer_state,        
           isdelete,       
           version,        
           effective_date,        
           shipping_address,     
           shipping_zip_code,    
           shipping_city,    
           shipping_state       
      from customer_dim        
     where customer_state = 'pa';       
    
    -- Loading product dimension        
    insert into tds.product_dim        
    (product_code,        
     product_name,        
     product_category,             
     isdelete,        
     version,        
     effective_date)        
    select case flag         
                when 'D' then a_product_code        
                else b_product_code        
            end product_code,        
           case flag         
                when 'D' then a_product_name        
                else b_product_name        
            end product_name,        
           case flag         
                when 'D' then a_product_category        
                else b_product_category        
            end product_category,        
           case flag         
                when 'D' then true        
                else false        
            end isdelete,        
           case flag         
                when 'D' then a_version        
                when 'I' then 1        
                else a_version + 1        
            end v,        
           v_pre_date        
      from (select a.product_code a_product_code,        
                   a.product_name a_product_name,        
                   a.product_category a_product_category,        
                   a.version a_version,        
                   b.product_code b_product_code,        
                   b.product_name b_product_name,        
                   b.product_category b_product_category,                       
                   case when a.product_code is null then 'I'        
                        when b.product_code is null then 'D'        
                        else 'U'         
                    end flag        
              from v_product_dim_latest a         
              full join rds.product b on a.product_code = b.product_code         
             where a.product_code is null -- Newly added        
                or b.product_code is null -- delete        
                or (a.product_code = b.product_code         
                    and not         
                           (a.product_name = b.product_name         
                        and a.product_category = b.product_category))) t        
             order by coalesce(a_product_code, 999999999999), b_product_code limit 999999999999;        

    -- Loading Sales Order Fact Sheet          
    insert into sales_order_fact          
    select a.order_number,          
           customer_sk,          
           product_sk,          
           e.date_sk,        
           e.year * 100 + e.month,             
           order_amount,      
           order_quantity,  
           f.date_sk             
      from rds.sales_order a,         
           v_customer_dim_his c,          
           v_product_dim_his d,          
           date_dim e,   
           date_dim f,           
           rds.cdc_time g          
     where a.customer_number = c.customer_number          
       and a.order_date >= c.effective_date        
       and a.order_date < c.expiry_date           
       and a.product_code = d.product_code          
       and a.order_date >= d.effective_date        
       and a.order_date < d.expiry_date           
       and date(a.order_date) = e.date    
       and date(a.request_delivery_date) = f.date    
       and a.entry_date >= g.last_load and a.entry_date < g.current_load;                      
        
    -- Analysis tds Table of patterns        
    analyze customer_dim;        
    analyze product_dim;        
    analyze sales_order_fact;        
        
    -- Update timestamp table last_load field          
    truncate table rds.cdc_time;        
    insert into rds.cdc_time select v_cur_date, v_cur_date;        
        
end;          
$$          
language plpgsql;
The following two modifications have been made to the function:
  • Remove statements that load and analyze order_dim dimension tables.
  • The order_number field bytes in the fact table are obtained from the rds.sales_order table.

III. Testing


1. Preparing test data

Execute the following SQL script to add two sales order records to the source library.
use source;   

set @start_date := unix_timestamp('2017-05-25');    
set @end_date := unix_timestamp('2017-05-25 12:00:00');    
set @order_date := from_unixtime(@start_date + rand() * (@end_date - @start_date)); 
set @request_delivery_date := from_unixtime(unix_timestamp(current_date+5) + rand() * 86400);      
set @amount := floor(1000 + rand() * 9000);   
set @quantity := floor(10 + rand() * 90);      

insert into sales_order values (null,1,1,@order_date,@request_delivery_date,@order_date,@amount,@quantity);          
 
set @start_date := unix_timestamp('2017-05-25 12:00:01');    
set @end_date := unix_timestamp('2017-05-26');    
set @order_date := from_unixtime(@start_date + rand() * (@end_date - @start_date)); 
set @request_delivery_date := from_unixtime(unix_timestamp(current_date+5) + rand() * 86400);   
set @amount := floor(1000 + rand() * 9000);   
set @quantity := floor(10 + rand() * 90);      

insert into sales_order values (null,1,1,@order_date,@request_delivery_date,@order_date,@amount,@quantity);          
  
commit ;
The above statement generates two sales orders on the source library on May 25, 2017. To ensure that the order number is the same as the order time sequence, note the assignment of the @order_date variable.

2. Execute periodic load functions and view results

~/regular_etl.sh
After the script is executed successfully, query the sales_order_fact s table to verify that the two new orders are loaded correctly.
select a.order_number,   
       customer_name,   
       product_name,   
       e.date,
       f.date,	   
       order_amount amount,
       order_quantity quantity	   
  from sales_order_fact a,
       customer_dim b,  
       product_dim c,   
       date_dim e,
       date_dim f	   
 where a.customer_sk = b.customer_sk 
   and a.product_sk = c.product_sk    
   and a.order_date_sk = e.date_sk 
   and a.request_delivery_date_sk = f.date_sk   
 order by order_number desc
 limit 5;
The query results are shown in Figure 2. You can see that the order numbers of the two new records are loaded correctly.
Figure 2

Posted by Das Capitolin on Thu, 27 Jun 2019 18:56:54 -0700