Production Optimization Practice of hive3. X on spark 3.0

Keywords: Hadoop hive Data Warehouse

1 data tilt

  • Most tasks are completed quickly, and only one or a few tasks are executed slowly or even fail. This phenomenon is data skew.
  • The data skew is divided into single table query with GroupBy field and two table (or multi table) Join query.

1.1 single table data skew optimization

1.1.1 the map side performs aggregation - GroupBy operation, and the aggregation function is count or sum

set = true;
set hive.groupby.mapaggr.checkinterval = 100000;
set hive.groupby.skewindata = true;
# When the option is set to true, the generated query plan will have two Mr Jobs.

1.1.2 increase the number of Reduce (multiple keys cause data skew at the same time)

Method 1:
set hive.exec.reducers.bytes.per.reducer = 256000000
set hive.exec.reducers.max = 1009
 calculation reducer Formula of number
N=min(Parameter 2, total input data/Parameter 1)(Parameter 2 refers to 1009 above, and the value of parameter 1 is 256 M)

Method 2:
set mapreduce.job.reduces = 15;

1.2 multi table Join data skew optimization

1.2.1 split tilt key

# If the number of records corresponding to the join key exceeds this value, it will be split. The value is set according to the specific amount of data
set hive.skewjoin.key=100000;
# If the join process is skewed, it should be set to true
set hive.optimize.skewjoin=false;
If it's on, in Join In the process Hive The count will exceed the threshold hive.skewjoin.key(Default (100000) tilt key The corresponding line is temporarily written into the file, and then another one is started job do map join Generate results. through The parameter can also control the second job of mapper Quantity, 10000 by default.

1.2.2 MapJoin

  • MapJoin is to directly distribute the smaller tables on both sides of the Join to the memory of each Map process. The Join operation is carried out in the Map process, so there is no need to carry out the Reduce step, which improves the speed.
set; #The default is true
set hive.mapjoin.smalltable.filesize=25000000;

2. Performance optimization of large amount of data

2.1 zoning table

  • The partition in Hive is a subdirectory, which divides a large data set into small data sets according to business needs.
  • The partition field cannot be data that already exists in the table. The partition field can be regarded as a pseudo column of the table.
  • Partition must be specified when the partition table loads data
  • When querying, select the specified partition required by the query through the expression in the WHERE clause
  • Create secondary partition
hive (default)> create table dept_partition2(
 deptno int,
 dname string,
 loc string)
 partitioned by (day string, hour string)
 row format delimited fields terminated by '\t';
  • Dynamic partition: when importing data, the database will automatically insert the data into the corresponding partition according to the value of the partition field
# Enable the dynamic partition function (true by default, enabled)
set hive.exec.dynamic.partition=true; 
# Set to non strict mode (the mode of dynamic partition, strict by default, means that at least one partition must be specified as a static partition, and the non strict mode means that all partition fields are allowed to use dynamic partitions.)
set hive.exec.dynamic.partition.mode=nonstrict;

2.2 barrel distribution table

  • Bucket splitting is another technique for breaking up a data set into parts that are easier to manage. The partition is for the storage path of data, and the bucket is for the data file.
create table stu_buck(id int, name string)
clustered by(id) 
into 4 buckets
row format delimited fields terminated by '\t';

3 HQL query performance optimization

3.1 file storage format and compression algorithm

Combining compression speed, compression ratio, column storage and row storage, in actual project development, the data storage format of hive table is generally orc or parquet. Generally, snappy and lzo are selected for compression.

create table log_parquet_snappy(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
row format delimited fields terminated by '\t'
stored as parquet

3.2 multiple modes

  • A pile of SQL is scanned from the same table for different logic. There are areas that can be optimized: if there are n SQL, each SQL execution will scan this table once.
insert int t_ptn partition(city=A). select id,name,sex, age from student 
where city= A;
insert int t_ptn partition(city=B). select id,name,sex, age from student 
where city= B;
insert int t_ptn partition(city=c). select id,name,sex, age from student 
where city= c;

Amend to read:
from student
insert int t_ptn partition(city=A) select id,name,sex, age where city= A
insert int t_ptn partition(city=B) select id,name,sex, age where city= B

3.3 vectorization: vector computing technology

  • When calculating scan, filter and aggregation, vectorization technology sets the increment size of batch processing to 1024 lines, so as to achieve higher efficiency than a single record.
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;

3.4 in/exists efficient usage left semi join

select, from a where in (select from b);
select, from a where exists (select id from b where =;

-- left semi join realization
select, from a left semi join b on =;

3.5 start CBO cost optimizer

  • Hive optimizes the execution logic and physical execution plan of each query before providing final execution. Further optimization is performed according to the query cost, resulting in potentially different decisions: how to sort connections, what type of connections to perform, parallelism, and so on
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;

3.6 predicate push down

  • Execute the where predicate logic in the SQL statement as early as possible to reduce the amount of data processed downstream
 set hive.optimize.ppd = true; #Predicate push down, the default is true

3.7 large table join- Sort Merge Bucket Join

set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;

# Bucket table join
insert overwrite table jointable
select, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from bigtable_buck1 s
join bigtable_buck2 b on =;

4 Hive Job performance optimization

4.1 hive map side optimization

4.1.1 increase the number of maps for complex files

  • Computeslatesize (math.max (minsize, math. Min (maxsize, blocksize)) = blocksize = 128M formula,
    Adjust the maxSize maximum. If the maximum maxSize is lower than the block size, the number of map s can be increased.
  • set mapreduce.input.fileinputformat.split.maxsize=100;
  • 4.1.2 small file consolidation

  • The size of the merged file. The default is 256M set hive.merge.size.per.task = 268435456;
    When the average size of the output file is less than this value, start an independent map reduce task to merge the file
    set hive.merge.smallfiles.avgsize = 16777216;
  • Merge small files before map execution
set hive.input.format=;
  • Merge small files at the end of the map only task. The default is true
set hive.merge.mapfiles = true;
  • Merge small files at the end of the map reduce task. The default is false
set hive.merge.mapredfiles = true;

4.1 reasonably set the number of reduce

  • These two principles also need to be considered when setting the number of reduce: deal with a large amount of data and use the appropriate number of reduce;
    Make the amount of data processed by a single reduce task appropriate;
# The amount of data processed by each Reduce is 256MB by default
set hive.exec.reducers.bytes.per.reducer = 256000000
# The maximum number of reduce for each task is 1009 by default
set hive.exec.reducers.max = 1009
# Formula for calculating reducer number
N=min(Parameter 2, total input data/Parameter 1)(Parameter 2 refers to 1009 above, and the value of parameter 1 is 256 M)
# Set the number of Reduce for each job
set mapreduce.job.reduces = 15;

4.3 parallel execution

  • Hive transforms a query into one or more stages. Such stages can be MapReduce stage, sampling stage, merging stage and limit stage. Or other stages that hive may need during execution. By default, hive executes only one phase at a time.
set hive.exec.parallel=true; //Open task parallel execution. The default value is false
set hive.exec.parallel.thread.number=16; //The maximum parallelism allowed for the same sql is 8 by default

4.4 local mode

  • Hive can handle all tasks on a single machine through local mode. For small data sets, the execution time can be significantly reduced
# Set the maximum input data amount of local mr. when the input data amount is less than this value, the mode of local mr is adopted. The default is 134217728, i.e. 128M
# Set the maximum number of input files of local mr. when the number of input files is less than this value, the local mr mode is adopted, and the default is 4

5. Spark operation parameter tuning

  • This parameter indicates the number of CPU cores available to each Executor. The value should not be set too large, because the underlying layer of Hive is stored in HDFS, and HDFS sometimes does not handle high concurrent writes well, which is easy to cause race condition. According to experience and practice, it is reasonable to set it between 3 and 6
  • Since an Executor needs a YARN Container to run, it is also necessary to ensure spark.executor.cores
    The value of cannot be greater than the maximum number of cores that a single Container can apply for, that is, the value of yarn.scheduler.maximum-allocationvcores.
  • These two parameters represent the amount of in heap memory and out of heap memory available to each Executor, respectively
  • Hive officially provides an empirical formula for calculating the total memory of the Executor, as follows:
  • If we have a total of 10 32C/128G nodes and follow the above configuration (i.e. each section)
    Point hosts 7 executors), so theoretically, we can set spark.executor.instances to 70 to make the cluster
    Maximize the use of resources. However, in fact, it is generally set to be smaller (the recommended value is about half of the theoretical value, such as 40),
    Because the Driver also needs to occupy resources, and a YARN cluster often has to carry other resources except Hive on Spark
  • It is recommended to set the spark.dynamicAllocation.enabled parameter to true to enable the Executor dynamic allocation.
set hive.execution.engine=spark;
set spark.executor.memory=11.2g;
set spark.yarn.executor.memoryOverhead=2.8g;
set spark.executor.cores=4;
set spark.executor.instances=40;
set spark.dynamicAllocation.enabled=true;
set spark.serializer=org.apache.spark.serializer.KryoSerializer;
  • This parameter indicates the number of CPU cores available to each Driver. In most cases, setting to 1 is sufficient.
  • These two parameters represent the amount of in heap memory and out of heap memory available to each Driver. According to the resource surplus process
    Generally, the total amount is controlled between 512MB and 4GB, and the "two eight points" of the Executor memory is used
    Configuration mode ". For example, spark.driver.memory can be set to about 819MB, and spark.driver.memoryOverhead can be set to
    About 205MB, which adds up to exactly 1G

Posted by ben2k8 on Thu, 16 Sep 2021 14:35:16 -0700