Construction of 3 data warehouses in the actual combat of shangsilicon Valley data warehouse

Keywords: Big Data Data Warehouse

@

Warehouse notes

Detailed explanation of data warehouse and data mart: ODS, DW, DWD, DWM, DWS, ADS

Project requirements and architecture design of shangsilicon Valley data warehouse

Shang Silicon Valley data warehouse: 2 data warehouse layering + dimension modeling

Construction of 3 data warehouses in the actual combat of shangsilicon Valley data warehouse

Shangsi Valley data warehouse 4.0 video tutorial

Direct from station B: 2021 new version of e-commerce data warehouse V4.0 - big data warehouse project practice
Baidu online disk: https://pan.baidu.com/s/1FGUb8X0Wx7IWAmKXBRwVFg , extraction code: yyds
Alicloud disk: https://www.aliyundrive.com/s/F2FuMVePj92 , extraction code: 335o

Chapter 4 data warehouse construction ODS floor

1) Keep the original appearance of the data without any modification, and play the role of backing up the data.

2) LZO compression is adopted for data to reduce disk storage space. 100G data can be compressed to less than 10G.

3) Create a partition table to prevent subsequent full table scanning, and use a large number of partition tables in enterprise development.

4) Create an external table. In enterprise development, in addition to creating internal tables for their own temporary tables, the vast majority of scenarios are to create external tables.

4.2 ODS layer (business data)

The service table partition planning of ODS layer is as follows:

The data loading idea of ODS layer business table is as follows

4.2.1 activity information table

DROP TABLE IF EXISTS ods_activity_info;

CREATE EXTERNAL TABLE ods_activity_info(

  `id` STRING COMMENT 'number',

  `activity_name` STRING  COMMENT 'Activity name',

  `activity_type` STRING  COMMENT 'activity type',

  `start_time` STRING  COMMENT 'start time',

  `end_time` STRING  COMMENT 'End time',

  `create_time` STRING  COMMENT 'Creation time'

) COMMENT 'Activity information table'

PARTITIONED BY (`dt` STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

STORED AS

 INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'

 OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION '/warehouse/gmall/ods/ods_activity_info/';

Chapter 5 data warehouse construction - DIM floor

5.1 commodity dimension table (total quantity)

1. CREATE TABLE statement

DROP TABLE IF EXISTS dim_sku_info;

CREATE EXTERNAL TABLE dim_sku_info (

  `id` STRING COMMENT 'commodity id',

  `price` DECIMAL(16,2) COMMENT 'commodity price',

  `sku_name` STRING COMMENT 'Trade name',

  `sku_desc` STRING COMMENT 'Product description',

  `weight` DECIMAL(16,2) COMMENT 'weight',

  `is_sale` BOOLEAN COMMENT 'Is it on sale',

  `spu_id` STRING COMMENT 'spu number',

  `spu_name` STRING COMMENT 'spu name',

  `category3_id` STRING COMMENT 'Three level classification id',

  `category3_name` STRING COMMENT 'Three level classification name',

  `category2_id` STRING COMMENT 'Secondary classification id',

  `category2_name` STRING COMMENT 'Secondary classification name',

  `category1_id` STRING COMMENT 'Primary classification id',

  `category1_name` STRING COMMENT 'Primary classification name',

  `tm_id` STRING COMMENT 'brand id',

  `tm_name` STRING COMMENT 'Brand name',

  `sku_attr_values` ARRAY<STRUCT<attr_id:STRING,value_id:STRING,attr_name:STRING,value_name:STRING>> COMMENT 'Platform properties',

  `sku_sale_attr_values` ARRAY<STRUCT<sale_attr_id:STRING,sale_attr_value_id:STRING,sale_attr_name:STRING,sale_attr_value_name:STRING>> COMMENT 'Sales attributes',

  `create_time` STRING COMMENT 'Creation time'

) COMMENT 'Commodity dimension table'

PARTITIONED BY (`dt` STRING)

STORED AS PARQUET

LOCATION '/warehouse/gmall/dim/dim_sku_info/'

TBLPROPERTIES ("parquet.compression"="lzo");

2. Zoning planning

3. Data loading

5.6 user dimension table (zipper table)

5.6.1 zipper table overview

1) What is a zipper watch

2) Why make a zipper watch

3) How to use zipper Watch

4) Zipper watch forming process

5.6.2 making zipper table

1. CREATE TABLE statement

DROP TABLE IF EXISTS dim_user_info;

CREATE EXTERNAL TABLE dim_user_info(

  `id` STRING COMMENT 'user id',

  `login_name` STRING COMMENT 'User name',

  `nick_name` STRING COMMENT 'User nickname',

  `name` STRING COMMENT 'User name',

  `phone_num` STRING COMMENT 'phone number',

  `email` STRING COMMENT 'mailbox',

  `user_level` STRING COMMENT 'User level',

  `birthday` STRING COMMENT 'birthday',

  `gender` STRING COMMENT 'Gender',

  `create_time` STRING COMMENT 'Creation time',

  `operate_time` STRING COMMENT 'Operation time',

  `start_date` STRING COMMENT 'Start date',

  `end_date` STRING COMMENT 'End date'

) COMMENT 'User table'

PARTITIONED BY (`dt` STRING)

STORED AS PARQUET

LOCATION '/warehouse/gmall/dim/dim_user_info/'

TBLPROPERTIES ("parquet.compression"="lzo");

2. Zoning planning

3. Data loading

1) First day loading

When the zipper table is loaded on the first day, initialization is required. The specific work is to import all historical users up to the initialization day into the zipper table at one time. Current ODS_ user_ The first partition of info table, i.e. 2020-06-14 partition, is all historical users. Therefore, after certain processing, the partition data can be imported into 9999-99-99 partition of zipper table.

2) Daily loading

(1) Realization idea

Chapter 6 data warehouse construction - DWD floor

1) Analyze user behavior data.

2) Re model the business data using the dimension model.

6.1 DWD layer (user behavior log)

6.1.1 log analysis ideas

1) Log structure review

(1) Page buried point log

(2) Startup log

2) Log parsing idea

6.1.2 json_ Use of object function

There are also JSON processing functions for response in Mysql, but the performance is not hive high.

6.1.3 startup log table

Startup log analysis idea: each row of data in the startup log table corresponds to a startup record. A startup record should contain the public information and startup information in the log. First filter out all logs containing the start field, and then use get_ json_ The object function parses each field.

1) Create table statement

DROP TABLE IF EXISTS dwd_start_log;

CREATE EXTERNAL TABLE dwd_start_log(

  `area_code` STRING COMMENT 'Area code',

  `brand` STRING COMMENT 'Mobile phone brand',

  `channel` STRING COMMENT 'channel',

  `is_new` STRING COMMENT 'First start',

  `model` STRING COMMENT 'Mobile phone model',

  `mid_id` STRING COMMENT 'equipment id',

  `os` STRING COMMENT 'operating system',

  `user_id` STRING COMMENT 'member id',

  `version_code` STRING COMMENT 'app Version number',

  `entry` STRING COMMENT 'icon Phone icon notice notice install Start after installation',

  `loading_time` BIGINT COMMENT 'Start loading time',

  `open_ad_id` STRING COMMENT 'Advertising page ID ',

  `open_ad_ms` BIGINT COMMENT 'Total advertising time',

  `open_ad_skip_ms` BIGINT COMMENT 'User skipping advertisement time point',

  `ts` BIGINT COMMENT 'time'

) COMMENT 'Start log table'

PARTITIONED BY (`dt` STRING) -- Create partitions by time

STORED AS PARQUET -- use parquet Column storage

LOCATION '/warehouse/gmall/dwd/dwd_start_log' -- Specify in HDFS Upper storage location

TBLPROPERTIES('parquet.compression'='lzo') -- use LZO compress

;

6.1.4 page log table

Page log analysis idea: each row of data in the page log table corresponds to a page access record. A page access record should contain the public information and page information in the log. First filter out all the logs containing the page field, and then use get_ json_ The object function parses each field.

6.1.5 action log table

Action log analysis idea: each row of data in the action log table corresponds to an action record of the user. An action record should contain public information, page information and action information. First filter out the log containing the action field, then "blow up" the action array through the UDTF function (similar to the effect of the explode function), and then use get_ json_ The object function parses each field.

6.1.7 error log table

Error log analysis idea: each row of data in the error log table corresponds to an error record. In order to facilitate error location, an error record should contain the corresponding public information, page information, exposure information, action information, startup information and error information. First filter out the logs containing the err field, and then use get_ json_ The object function parses all fields.

6.2 DWD layer (business data)

In terms of business data, the main focus of DWD layer construction is dimension modeling. Here is just an example of order details. Please download the actual combat notes of sunsilicon valley.

6.2.2 order detail fact table (transactional fact table)

1) Create table statement

DROP TABLE IF EXISTS dwd_order_detail;

CREATE EXTERNAL TABLE dwd_order_detail (

  `id` STRING COMMENT 'Order No',

  `order_id` STRING COMMENT 'order number',

  `user_id` STRING COMMENT 'user id',

  `sku_id` STRING COMMENT 'sku commodity id',

  `province_id` STRING COMMENT 'province ID',

  `activity_id` STRING COMMENT 'activity ID',

  `activity_rule_id` STRING COMMENT 'Activity rules ID',

  `coupon_id` STRING COMMENT 'coupon ID',

  `create_time` STRING COMMENT 'Creation time',

  `source_type` STRING COMMENT 'Source type',

  `source_id` STRING COMMENT 'Source number',

  `sku_num` BIGINT COMMENT 'Quantity of goods',

  `original_amount` DECIMAL(16,2) COMMENT 'Original price',

  `split_activity_amount` DECIMAL(16,2) COMMENT 'Activity preference allocation',

  `split_coupon_amount` DECIMAL(16,2) COMMENT 'Coupon discount allocation',

  `split_final_amount` DECIMAL(16,2) COMMENT 'Final price allocation'

) COMMENT 'Order details fact sheet'

PARTITIONED BY (`dt` STRING)

STORED AS PARQUET

LOCATION '/warehouse/gmall/dwd/dwd_order_detail/'

TBLPROPERTIES ("parquet.compression"="lzo");

2) Zoning planning

3) Data loading

6.2.4 additional purchase fact table (periodic snapshot fact table, daily snapshot)

1) Create table statement

DROP TABLE IF EXISTS dwd_cart_info;
CREATE EXTERNAL TABLE dwd_cart_info(
  `id` STRING COMMENT 'number',
  `user_id` STRING COMMENT 'user ID',
  `sku_id` STRING COMMENT 'commodity ID',
  `source_type` STRING COMMENT 'Source type',
  `source_id` STRING COMMENT 'Source number',
  `cart_price` DECIMAL(16,2) COMMENT 'Price when adding to shopping cart',
  `is_ordered` STRING COMMENT 'Have you placed an order',
  `create_time` STRING COMMENT 'Creation time',
  `operate_time` STRING COMMENT 'Modification time',
  `order_time` STRING COMMENT 'Order time ',
  `sku_num` BIGINT COMMENT 'Additional purchase quantity'
) COMMENT 'Additional purchase fact sheet'
PARTITIONED BY (`dt` STRING)
STORED AS PARQUET
LOCATION '/warehouse/gmall/dwd/dwd_cart_info/'
TBLPROPERTIES ("parquet.compression"="lzo");

2) Zoning planning

3) Data loading

6.2.7 payment fact table (cumulative snapshot fact table)

1) Create table statement

DROP TABLE IF EXISTS dwd_payment_info;

CREATE EXTERNAL TABLE dwd_payment_info (

  `id` STRING COMMENT 'number',

  `order_id` STRING COMMENT 'Order No',

  `user_id` STRING COMMENT 'User number',

  `province_id` STRING COMMENT 'region ID',

  `trade_no` STRING COMMENT 'Transaction number',

  `out_trade_no` STRING COMMENT 'External transaction No',

  `payment_type` STRING COMMENT 'Payment type',

  `payment_amount` DECIMAL(16,2) COMMENT 'Payment amount',

  `payment_status` STRING COMMENT 'Payment status',

  `create_time` STRING COMMENT 'Creation time',--Time of calling the third-party payment interface

  `callback_time` STRING COMMENT 'Completion time'--Payment completion time, i.e. payment success callback time

) COMMENT 'Statement of payment facts'

PARTITIONED BY (`dt` STRING)

STORED AS PARQUET

LOCATION '/warehouse/gmall/dwd/dwd_payment_info/'

TBLPROPERTIES ("parquet.compression"="lzo");

2) Zoning planning

3) Data loading

Chapter 7 data warehouse construction - DWS floor

7.2 DWS floor


7.2.1 visitor theme

1) Create table statement

DROP TABLE IF EXISTS dws_visitor_action_daycount;

CREATE EXTERNAL TABLE dws_visitor_action_daycount

(

  `mid_id` STRING COMMENT 'equipment id',

  `brand` STRING COMMENT 'Equipment brand',

  `model` STRING COMMENT 'Equipment model',

  `is_new` STRING COMMENT 'First visit',

  `channel` ARRAY<STRING> COMMENT 'channel',

  `os` ARRAY<STRING> COMMENT 'operating system',

  `area_code` ARRAY<STRING> COMMENT 'region ID',

  `version_code` ARRAY<STRING> COMMENT 'Application version',

  `visit_count` BIGINT COMMENT 'Number of visits',

  `page_stats` ARRAY<STRUCT<page_id:STRING,page_count:BIGINT,during_time:BIGINT>> COMMENT 'Page access statistics'

) COMMENT 'Daily equipment behavior table'

PARTITIONED BY(`dt` STRING)

STORED AS PARQUET

LOCATION '/warehouse/gmall/dws/dws_visitor_action_daycount'

TBLPROPERTIES ("parquet.compression"="lzo");

Chapter 8 data warehouse construction - DWT floor

Chapter 9 data warehouse construction - ADS floor

9.1 statement of construction

The ADS layer does not involve modeling, and the table creation depends on the specific requirements.

9.3 user theme

9.3.1 user statistics

The demand is user comprehensive statistics, which includes several indicators. The following is the explanation of each indicator.

index explain Corresponding field
Number of new users Count the number of new registered users new_user_count
Number of new order users Count the number of new order users new_order_user_count
Total order amount Count the total amount of all orders order_final_amount
Number of single users Count the total number of users placing orders order_user_count
Number of users without orders Count the number of active users without placing orders no_order_user_count

1. CREATE TABLE statement

DROP TABLE IF EXISTS ads_user_total;

CREATE EXTERNAL TABLE `ads_user_total` (

 `dt` STRING COMMENT 'Statistical date',

 `recent_days` BIGINT COMMENT 'Recent days,0:Cumulative value,1:Last 1 day,7:Last 7 days,30:Last 30 days',

 `new_user_count` BIGINT COMMENT 'Number of newly registered users',

 `new_order_user_count` BIGINT COMMENT 'Number of new order users',

 `order_final_amount` DECIMAL(16,2) COMMENT 'Total order amount',

 `order_user_count` BIGINT COMMENT 'Number of single users',

 `no_order_user_count` BIGINT COMMENT 'Number of users without orders(Specifically, it refers to the users who have not placed an order among the active users)'

) COMMENT 'User statistics'

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

LOCATION '/warehouse/gmall/ads/ads_user_total/';

9.3.2 user change statistics

The demand includes two indicators, namely, the number of lost users and the number of returned users. The following is an explanation of the two indicators.

index explain Corresponding field
Number of lost users Users who have been active before but have not been active recently are called lost users. Here, it is required to count the total number of users who were active before the 7th day (only including the day before the 7th day) but were not active in the last 7 days. user_churn_count
Number of reflow users The former active users, who have not been active (lost) for some time, are active again today, which is called return users. It is required to count the total number of reflux users here. new_order_user_count

1. CREATE TABLE statement

DROP TABLE IF EXISTS ads_user_change;

CREATE EXTERNAL TABLE `ads_user_change` (

 `dt` STRING COMMENT 'Statistical date',

 `user_churn_count` BIGINT COMMENT 'Number of lost users',

 `user_back_count` BIGINT COMMENT 'Number of reflow users'

) COMMENT 'User change statistics'

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

LOCATION '/warehouse/gmall/ads/ads_user_change/';

9.3.3 user behavior funnel analysis

Funnel analysis is a data analysis model, which can scientifically reflect the user transformation in each stage of a business process from the beginning to the end. Because it can show the links of each stage, it can be clear at a glance which stage has problems.

This demand requires statistics on the number of people in each stage of a complete shopping process.

1. CREATE TABLE statement

DROP TABLE IF EXISTS ads_user_action;

CREATE EXTERNAL TABLE `ads_user_action` (

 `dt` STRING COMMENT 'Statistical date',

 `recent_days` BIGINT COMMENT 'Recent days,1:Last 1 day,7:Last 7 days,30:Last 30 days',

 `home_count` BIGINT COMMENT 'Number of home page visitors',

 `good_detail_count` BIGINT COMMENT 'Number of visitors to the product details page',

 `cart_count` BIGINT COMMENT 'Number of people added to shopping cart',

 `order_count` BIGINT COMMENT 'Number of orders',

 `payment_count` BIGINT COMMENT 'Number of payers'

) COMMENT 'Funnel analysis'

ROW FORMAT DELIMITED  FIELDS TERMINATED BY '\t'

LOCATION '/warehouse/gmall/ads/ads_user_action/';

9.3.4 user retention rate

Retention analysis generally includes new retention and active retention analysis.

New retention analysis is to analyze how many new users have subsequent active behaviors on a certain day. Active retention analysis is to analyze how many active users have subsequent active behaviors on a certain day.

Retention analysis is an important indicator to measure the value of products to users.

Here, it is required to count the new retention rate. The new retention rate specifically refers to the ratio of the number of retained users to the number of new users. For example, when 100 users are added on June 14, 2020, 80 of the 100 people are active one day later (June 15, 2020), the number of retained users on June 14, 2020 is 80, and the one-day retention rate on June 14, 2020 is 80%.

It is required to count the retention rate from 1 to 7 days every day, as shown in the figure below.

1. CREATE TABLE statement

DROP TABLE IF EXISTS ads_user_retention;

CREATE EXTERNAL TABLE ads_user_retention (

 `dt` STRING COMMENT 'Statistical date',

 `create_date` STRING COMMENT 'User new date',

 `retention_day` BIGINT COMMENT 'Retention days as of current date',

 `retention_count` BIGINT COMMENT 'Number of retained users',

 `new_user_count` BIGINT COMMENT 'Number of new users',

 `retention_rate` DECIMAL(16,2) COMMENT 'Retention '

) COMMENT 'User retention rate'

ROW FORMAT DELIMITED  FIELDS TERMINATED BY '\t'

LOCATION '/warehouse/gmall/ads/ads_user_retention/';

9.5 order subject

9.5.1 order statistics

The demand includes the total number of orders, the total amount of orders and the total number of people placing orders.

1. CREATE TABLE statement

DROP TABLE IF EXISTS ads_order_total;

CREATE EXTERNAL TABLE `ads_order_total` (

 `dt` STRING COMMENT 'Statistical date',

 `recent_days` BIGINT COMMENT 'Recent days,1:Last 1 day,7:Last 7 days,30:Last 30 days',

 `order_count` BIGINT COMMENT 'Number of orders',

 `order_amount` DECIMAL(16,2) COMMENT 'Order amount',

 `order_user_count` BIGINT COMMENT 'Number of orders'

) COMMENT 'Order statistics'

ROW FORMAT DELIMITED  FIELDS TERMINATED BY '\t'

LOCATION '/warehouse/gmall/ads/ads_order_total/';

9.5.2 order statistics by Region

The demand includes the total number of orders and the total amount of orders in each province.

1. CREATE TABLE statement

DROP TABLE IF EXISTS ads_order_by_province;

CREATE EXTERNAL TABLE `ads_order_by_province` (

 `dt` STRING COMMENT 'Statistical date',

 `recent_days` BIGINT COMMENT 'Recent days,1:Last 1 day,7:Last 7 days,30:Last 30 days',

 `province_id` STRING COMMENT 'province ID',

 `province_name` STRING COMMENT 'Province name',

 `area_code` STRING COMMENT 'Area code',

 `iso_code` STRING COMMENT 'International standard area code',

 `iso_code_3166_2` STRING COMMENT 'International standard area code',

 `order_count` BIGINT COMMENT 'Number of orders',

 `order_amount` DECIMAL(16,2) COMMENT 'Order amount'

) COMMENT 'Order statistics by Region'

ROW FORMAT DELIMITED  FIELDS TERMINATED BY '\t'

LOCATION '/warehouse/gmall/ads/ads_order_by_province/';

Chapter 10 whole process scheduling

Posted by mogster on Fri, 03 Dec 2021 05:27:35 -0800