Sometimes, there is nothing in the dimension table except the business primary key. For example, in this sales order example, the order dimension table has no other attributes except the order number, which is the primary key of the transaction table, and this dimension is the degenerate dimension. Primary keys in business systems are usually not allowed to be modified. Sales orders can only be added, the existing order number can not be modified, and the order record can not be deleted. Therefore, the order dimension table will not have the issue of historical data versions. Degenerate dimensions are common in transaction and cumulative snapshot fact tables.
Each row record in the sales order fact table includes the order number proxy key as the degeneration dimension. In the operational system, the sales order form is the most detailed transaction table, and the order number is the main key of the order form. Each order can be positioned by the order number. Other attributes of the order, such as customers, products, etc., all depend on the order number. That is to say, the order number associates the tables related to the order attributes. However, in the dimension model, the order number proxy key in the fact table is usually not associated with other tables of order attributes. The attributes of all concerns in the order fact table can be categorized into different dimensions, such as the order date related to the date dimension, the customer related to the customer dimension, and so on. The main reason for retaining the order number in the fact table is that it is used to connect the data warehouse with the operational system. It can also act as the primary key of the fact table. In some cases, one or two attributes may still belong to the order rather than to other dimensions. Of course, the order dimension is no longer a degenerate dimension.
Degraded dimensions are usually retained as identifiers for operational transactions. In fact, the order number can be added to the fact table as an attribute. In this way, the order dimension has no data needed by the data warehouse, and then the order dimension can be degraded. Relevant data of degraded dimensions need to be migrated to fact tables, and then degraded dimensions need to be deleted.
Note that control numbers in operational transactions, such as order number, invoice number, bill of lading number, etc., usually generate empty dimensions and are represented as degenerate dimensions in the transaction fact table.
Degraded Order Dimension
Dimensional degradation techniques are used to identify data and analyze data columns that have never been used. For example, the order_number column of the order dimension might be such a column. But if the user wants to see the details of the transaction, he also needs the order number. Therefore, before degrading the order dimension, migrate the order number to the sales_order_fact facts fact table. Figure 1 shows the modified pattern.Figure 1
The following four-step degenerate order_dim dimension table is executed sequentially:
- Add order_number column to sales_order_fact table
- Migrate the order number in the order_dim table to the sales_order_fact table
- Delete the order_sk column in the sales_order_fact table
- Delete order_dim table
HAWQ does not provide UPDATE functionality, so to update the order number of existing data, only reload all data. In this case, the value of the proxy key in the order number dimension table is the same as that of the order number business primary key. In fact, you can simply rename the order_sk field of the fact table to order_number. But this is only a special case. Often the values of proxy keys and business primary keys are different, so data is still regenerated in a standard way.set search_path=tds; alter table sales_order_fact rename to sales_order_fact_old; create table sales_order_fact as select t2.order_number, t1.customer_sk, t1.product_sk, t1.order_date_sk, t1.year_month, t1.order_amount, t1.order_quantity, t1.request_delivery_date_sk from sales_order_fact_old t1 inner join order_dim t2 on t1.order_sk = t2.order_sk; comment on table sales_order_fact is 'Sales Order Fact Sheet'; comment on column sales_order_fact.order_number is 'Order number'; comment on column sales_order_fact.customer_sk is 'Customer dimension proxy key'; comment on column sales_order_fact.product_sk is 'Product dimension agent key'; comment on column sales_order_fact.order_date_sk is 'Date dimension proxy key'; comment on column sales_order_fact.year_month is 'Annual and Monthly Zoning Key'; comment on column sales_order_fact.order_amount is 'sales amount'; comment on column sales_order_fact.order_quantity is 'Number'; comment on column sales_order_fact.request_delivery_date_sk is 'Request Delivery Date Agent Key'; drop table sales_order_fact_old; drop table order_dim;
Modifying Periodic Data Loading Function
Another thing to do after degrading a dimension is to modify the periodic data loading function. The order number needs to be added to the sales order fact table instead of importing the order dimension. The modified function is as follows.The following two modifications have been made to the function:create or replace function fn_regular_load () returns void as $$ declare -- Set up scd Effective time v_cur_date date := current_date; v_pre_date date := current_date - 1; v_last_load date; begin -- Analysis of external tables analyze ext.customer; analyze ext.product; analyze ext.sales_order; -- Loading external table data into the original table truncate table rds.customer; truncate table rds.product; insert into rds.customer select * from ext.customer; insert into rds.product select * from ext.product; insert into rds.sales_order select order_number, customer_number, product_code, order_date, entry_date, order_amount, order_quantity, request_delivery_date from ext.sales_order; -- Analysis rds Table of patterns analyze rds.customer; analyze rds.product; analyze rds.sales_order; -- Set up cdc Upper Limit Time select last_load into v_last_load from rds.cdc_time; truncate table rds.cdc_time; insert into rds.cdc_time select v_last_load, v_cur_date; -- Loading customer dimension insert into tds.customer_dim (customer_number, customer_name, customer_street_address, customer_zip_code, customer_city, customer_state, shipping_address, shipping_zip_code, shipping_city, shipping_state, isdelete, version, effective_date) select case flag when 'D' then a_customer_number else b_customer_number end customer_number, case flag when 'D' then a_customer_name else b_customer_name end customer_name, case flag when 'D' then a_customer_street_address else b_customer_street_address end customer_street_address, case flag when 'D' then a_customer_zip_code else b_customer_zip_code end customer_zip_code, case flag when 'D' then a_customer_city else b_customer_city end customer_city, case flag when 'D' then a_customer_state else b_customer_state end customer_state, case flag when 'D' then a_shipping_address else b_shipping_address end shipping_address, case flag when 'D' then a_shipping_zip_code else b_shipping_zip_code end shipping_zip_code, case flag when 'D' then a_shipping_city else b_shipping_city end shipping_city, case flag when 'D' then a_shipping_state else b_shipping_state end shipping_state, case flag when 'D' then true else false end isdelete, case flag when 'D' then a_version when 'I' then 1 else a_version + 1 end v, v_pre_date from (select a.customer_number a_customer_number, a.customer_name a_customer_name, a.customer_street_address a_customer_street_address, a.customer_zip_code a_customer_zip_code, a.customer_city a_customer_city, a.customer_state a_customer_state, a.shipping_address a_shipping_address, a.shipping_zip_code a_shipping_zip_code, a.shipping_city a_shipping_city, a.shipping_state a_shipping_state, a.version a_version, b.customer_number b_customer_number, b.customer_name b_customer_name, b.customer_street_address b_customer_street_address, b.customer_zip_code b_customer_zip_code, b.customer_city b_customer_city, b.customer_state b_customer_state, b.shipping_address b_shipping_address, b.shipping_zip_code b_shipping_zip_code, b.shipping_city b_shipping_city, b.shipping_state b_shipping_state, case when a.customer_number is null then 'I' when b.customer_number is null then 'D' else 'U' end flag from v_customer_dim_latest a full join rds.customer b on a.customer_number = b.customer_number where a.customer_number is null -- Newly added or b.customer_number is null -- delete or (a.customer_number = b.customer_number and not (coalesce(a.customer_name,'') = coalesce(b.customer_name,'') and coalesce(a.customer_street_address,'') = coalesce(b.customer_street_address,'') and coalesce(a.customer_zip_code,0) = coalesce(b.customer_zip_code,0) and coalesce(a.customer_city,'') = coalesce(b.customer_city,'') and coalesce(a.customer_state,'') = coalesce(b.customer_state,'') and coalesce(a.shipping_address,'') = coalesce(b.shipping_address,'') and coalesce(a.shipping_zip_code,0) = coalesce(b.shipping_zip_code,0) and coalesce(a.shipping_city,'') = coalesce(b.shipping_city,'') and coalesce(a.shipping_state,'') = coalesce(b.shipping_state,'') ))) t order by coalesce(a_customer_number, 999999999999), b_customer_number limit 999999999999; -- heavy load PA Customer dimension truncate table pa_customer_dim; insert into pa_customer_dim select customer_sk, customer_number, customer_name, customer_street_address, customer_zip_code, customer_city, customer_state, isdelete, version, effective_date, shipping_address, shipping_zip_code, shipping_city, shipping_state from customer_dim where customer_state = 'pa'; -- Loading product dimension insert into tds.product_dim (product_code, product_name, product_category, isdelete, version, effective_date) select case flag when 'D' then a_product_code else b_product_code end product_code, case flag when 'D' then a_product_name else b_product_name end product_name, case flag when 'D' then a_product_category else b_product_category end product_category, case flag when 'D' then true else false end isdelete, case flag when 'D' then a_version when 'I' then 1 else a_version + 1 end v, v_pre_date from (select a.product_code a_product_code, a.product_name a_product_name, a.product_category a_product_category, a.version a_version, b.product_code b_product_code, b.product_name b_product_name, b.product_category b_product_category, case when a.product_code is null then 'I' when b.product_code is null then 'D' else 'U' end flag from v_product_dim_latest a full join rds.product b on a.product_code = b.product_code where a.product_code is null -- Newly added or b.product_code is null -- delete or (a.product_code = b.product_code and not (a.product_name = b.product_name and a.product_category = b.product_category))) t order by coalesce(a_product_code, 999999999999), b_product_code limit 999999999999; -- Loading Sales Order Fact Sheet insert into sales_order_fact select a.order_number, customer_sk, product_sk, e.date_sk, e.year * 100 + e.month, order_amount, order_quantity, f.date_sk from rds.sales_order a, v_customer_dim_his c, v_product_dim_his d, date_dim e, date_dim f, rds.cdc_time g where a.customer_number = c.customer_number and a.order_date >= c.effective_date and a.order_date < c.expiry_date and a.product_code = d.product_code and a.order_date >= d.effective_date and a.order_date < d.expiry_date and date(a.order_date) = e.date and date(a.request_delivery_date) = f.date and a.entry_date >= g.last_load and a.entry_date < g.current_load; -- Analysis tds Table of patterns analyze customer_dim; analyze product_dim; analyze sales_order_fact; -- Update timestamp table last_load field truncate table rds.cdc_time; insert into rds.cdc_time select v_cur_date, v_cur_date; end; $$ language plpgsql;
- Remove statements that load and analyze order_dim dimension tables.
- The order_number field bytes in the fact table are obtained from the rds.sales_order table.
III. Testing
1. Preparing test data
Execute the following SQL script to add two sales order records to the source library.The above statement generates two sales orders on the source library on May 25, 2017. To ensure that the order number is the same as the order time sequence, note the assignment of the @order_date variable.use source; set @start_date := unix_timestamp('2017-05-25'); set @end_date := unix_timestamp('2017-05-25 12:00:00'); set @order_date := from_unixtime(@start_date + rand() * (@end_date - @start_date)); set @request_delivery_date := from_unixtime(unix_timestamp(current_date+5) + rand() * 86400); set @amount := floor(1000 + rand() * 9000); set @quantity := floor(10 + rand() * 90); insert into sales_order values (null,1,1,@order_date,@request_delivery_date,@order_date,@amount,@quantity); set @start_date := unix_timestamp('2017-05-25 12:00:01'); set @end_date := unix_timestamp('2017-05-26'); set @order_date := from_unixtime(@start_date + rand() * (@end_date - @start_date)); set @request_delivery_date := from_unixtime(unix_timestamp(current_date+5) + rand() * 86400); set @amount := floor(1000 + rand() * 9000); set @quantity := floor(10 + rand() * 90); insert into sales_order values (null,1,1,@order_date,@request_delivery_date,@order_date,@amount,@quantity); commit ;
2. Execute periodic load functions and view results
After the script is executed successfully, query the sales_order_fact s table to verify that the two new orders are loaded correctly.~/regular_etl.sh
The query results are shown in Figure 2. You can see that the order numbers of the two new records are loaded correctly.select a.order_number, customer_name, product_name, e.date, f.date, order_amount amount, order_quantity quantity from sales_order_fact a, customer_dim b, product_dim c, date_dim e, date_dim f where a.customer_sk = b.customer_sk and a.product_sk = c.product_sk and a.order_date_sk = e.date_sk and a.request_delivery_date_sk = f.date_sk order by order_number desc limit 5;
Figure 2