Solving the problem of Chinese garbled code in CDH hive table building comments

Keywords: hive

1. Problem description

After joining the new company, the author found that hive itself did not support Chinese annotations and was garbled when he was preparing to build the data warehouse, because the current big data started late and the data warehouse was still in the primary construction stage. The cluster was built with CDH 5.10.0.

show create table equipment;
CREATE EXTERNAL TABLE `equipment`(
  `id` bigint COMMENT '????', 
  `customer_id` bigint COMMENT '??id', 
  `eq_id` bigint COMMENT 'equipment id', 
  `create_time` bigint COMMENT '??????')
COMMENT '??????????'
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'hdfs://cloud-11:8020/big-data/ods/xxx/xxx/xxx/equipment'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='false', 
  'numFiles'='0', 
  'numRows'='-1', 
  'orc.compress'='LZO', 
  'rawDataSize'='-1', 
  'totalSize'='0', 
  'transient_lastDdlTime'='1633659470')

DESC equipment;
id	bigint	?????
customer_id	bigint	??id
eq_id	bigint	??id
create_time	bigint	??????

2. Problem analysis

When this problem occurs, I need to consider where the metadata information comes from, that is, where the data hive reads when executing show create table xxx and desc xxx. Obviously, these two commands are actually the schema of the table in the query warehouse, and those similar to the schema belong to metadata, Therefore, we think that this may be due to the inconsistency between the character encoding settings of our metabase and the encoding method used by hive. We know that hive uses utf-8 encoding by default. Since the author currently uses mysql as metadata storage in hive, we checked the encoding method of the metadata database we set:

We found that the encoding format used by the database metabase is utf8, which is no problem. So I checked the metadata table information related to field comments, table comments, partition comments, table name comments, view comments and database name comments:

CREATE TABLE `TABLE_PARAMS` (
  `TBL_ID` bigint(20) NOT NULL,
  `PARAM_KEY` varchar(256) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
  `PARAM_VALUE` varchar(4000) CHAR
  ACTER SET latin1 COLLATE latin1_bin DEFAULT NULL,
  PRIMARY KEY (`TBL_ID`,`PARAM_KEY`),
  KEY `TABLE_PARAMS_N49` (`TBL_ID`),
  CONSTRAINT `TABLE_PARAMS_FK1` FOREIGN KEY (`TBL_ID`) REFERENCES `TBLS` (`TBL_ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

CREATE TABLE `PARTITION_PARAMS` (
  `PART_ID` bigint(20) NOT NULL,
  `PARAM_KEY` varchar(256) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
  `PARAM_VALUE` varchar(4000) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL,
  PRIMARY KEY (`PART_ID`,`PARAM_KEY`),
  KEY `PARTITION_PARAMS_N49` (`PART_ID`),
  CONSTRAINT `PARTITION_PARAMS_FK1` FOREIGN KEY (`PART_ID`) REFERENCES `PARTITIONS` (`PART_ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

CREATE TABLE `PARTITION_KEYS` (
  `TBL_ID` bigint(20) NOT NULL,
  `PKEY_COMMENT` varchar(4000) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL,
  `PKEY_NAME` varchar(128) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
  `PKEY_TYPE` varchar(767) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
  `INTEGER_IDX` int(11) NOT NULL,
  PRIMARY KEY (`TBL_ID`,`PKEY_NAME`),
  KEY `PARTITION_KEYS_N49` (`TBL_ID`),
  CONSTRAINT `PARTITION_KEYS_FK1` FOREIGN KEY (`TBL_ID`) REFERENCES `TBLS` (`TBL_ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

 CREATE TABLE `INDEX_PARAMS` (
  `INDEX_ID` bigint(20) NOT NULL,
  `PARAM_KEY` varchar(256) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
  `PARAM_VALUE` varchar(4000) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL,
  PRIMARY KEY (`INDEX_ID`,`PARAM_KEY`),
  KEY `INDEX_PARAMS_N49` (`INDEX_ID`),
  CONSTRAINT `INDEX_PARAMS_FK1` FOREIGN KEY (`INDEX_ID`) REFERENCES `IDXS` (`INDEX_ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

CREATE TABLE `TBLS` (
  `TBL_ID` bigint(20) NOT NULL,
  `CREATE_TIME` int(11) NOT NULL,
  `DB_ID` bigint(20) DEFAULT NULL,
  `LAST_ACCESS_TIME` int(11) NOT NULL,
  `OWNER` varchar(767) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL,
  `RETENTION` int(11) NOT NULL,
  `SD_ID` bigint(20) DEFAULT NULL,
  `TBL_NAME` varchar(128) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL,
  `TBL_TYPE` varchar(128) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL,
  `VIEW_EXPANDED_TEXT` mediumtext,
  `VIEW_ORIGINAL_TEXT` mediumtext,
  `LINK_TARGET_ID` bigint(20) DEFAULT NULL,
  PRIMARY KEY (`TBL_ID`),
  UNIQUE KEY `UNIQUETABLE` (`TBL_NAME`,`DB_ID`),
  KEY `TBLS_N50` (`SD_ID`),
  KEY `TBLS_N49` (`DB_ID`),
  KEY `TBLS_N51` (`LINK_TARGET_ID`),
  CONSTRAINT `TBLS_FK1` FOREIGN KEY (`SD_ID`) REFERENCES `SDS` (`SD_ID`),
  CONSTRAINT `TBLS_FK2` FOREIGN KEY (`DB_ID`) REFERENCES `DBS` (`DB_ID`),
  CONSTRAINT `TBLS_FK3` FOREIGN KEY (`LINK_TARGET_ID`) REFERENCES `TBLS` (`TBL_ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

CREATE TABLE `DBS` (
  `DB_ID` bigint(20) NOT NULL,
  `DESC` varchar(4000) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL,
  `DB_LOCATION_URI` varchar(4000) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
  `NAME` varchar(128) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL,
  `OWNER_NAME` varchar(128) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL,
  `OWNER_TYPE` varchar(10) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL,
  PRIMARY KEY (`DB_ID`),
  UNIQUE KEY `UNIQUE_DATABASE` (`NAME`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

We found the problem. It turns out that these metadata tables use the latin1 encoding method, resulting in inconsistent metadata character encoding methods. After the investigation, I believe everyone has a better solution.

Solution:

Through the above analysis, it is obvious that we try to change the character encoding mode of these metadata tables to utf8. As follows:

#Modify database character set
alter database metastore  character set utf8;
 
#Modify field comment character set
ALTER TABLE COLUMNS_V2 modify column COMMENT varchar(256) character set utf8;
#Modify table comment character set
ALTER TABLE TABLE_PARAMS modify column PARAM_VALUE varchar(40000) character set utf8;
#Modify partition parameters and support partition construction in Chinese
ALTER TABLE PARTITION_PARAMS modify column PARAM_VALUE varchar(40000) character set utf8;
ALTER TABLE PARTITION_KEYS modify column PKEY_COMMENT varchar(40000) character set utf8;
#Modify the table name annotation and support Chinese representation
ALTER TABLE INDEX_PARAMS modify column PARAM_VALUE varchar(4000) character set utf8;
#Modify the view to support Chinese
ALTER TABLE TBLS modify COLUMN VIEW_EXPANDED_TEXT mediumtext CHARACTER SET utf8;
ALTER TABLE TBLS modify COLUMN VIEW_ORIGINAL_TEXT mediumtext CHARACTER SET utf8;
 
#Modify database name comments
ALTER TABLE `DBS` CHANGE COLUMN `DESC` `DESC` VARCHAR(4000) CHARACTER SET 'utf8' NULL DEFAULT NULL ;

4. Effect

Here is the effect picture after the author solves it. As follows:


This only provides a solution to the problem. I hope it will be helpful to you.

Posted by quimbley on Thu, 07 Oct 2021 21:01:11 -0700