[hard HBase] HBase Optimization: pre partition / RowKey design / memory optimization / basic optimization

Keywords: Big Data HBase Machine Learning

This article is right [hard big data learning route] learning guide for experts from zero to big data (fully upgraded version) HBase partial supplement.

1 high availability

In HBase, HMaster is responsible for monitoring the lifecycle of HRegionServer and balancing the load of regional server. If HMaster fails, the whole HBase cluster will fall into an unhealthy state, and the working state will not last too long. Therefore, HBase supports the highly available configuration of HMaster.

1. Close the HBase cluster (if not, skip this step)

[atguigu@hadoop102 hbase]$ bin/stop-hbase.sh

2. Create the backup masters file in the conf directory

[atguigu@hadoop102 hbase]$ touch conf/backup-masters

3. Configure the highly available HMaster node in the backup masters file

[atguigu@hadoop102 hbase]$ echo hadoop103 > conf/backup-masters

4. scp the entire conf directory to other nodes

[atguigu@hadoop102 hbase]$ scp -r conf/ 
[atguigu@hadoop102 hbase]$ scp -r conf/ 

5. Open the test page to view


2 pre zoning

Each Region maintains StartRow and EndRow. If the added data meets the RowKey range maintained by a Region, the data will be handed over to the Region for maintenance. According to this principle, we can roughly plan the partition where the data will be put in advance to improve the performance of HBase.

1. Manually set the pre partition

Hbase> create 'staff1','info','partition1',SPLITS => 

2. Generate hexadecimal sequence pre partition

create 'staff2','info','partition2',{NUMREGIONS => 15, SPLITALGO => 

3. Pre partition according to the rules set in the file

Create the splits.txt file as follows:


Then execute:

create 'staff3','partition3',SPLITS_FILE => 'splits.txt'

4. Create pre partition using Java API

//Customize the algorithm to generate a series of hash values, which are stored in a two-dimensional array
byte[][] splitKeys = A hash valued function
//Create an HbaseAdmin instance
HBaseAdmin hAdmin = new HBaseAdmin(HbaseConfiguration.create());
//Create HTableDescriptor instance
HTableDescriptor tableDesc = new HTableDescriptor(tableName);
//Create a pre partitioned Hbase table from an HTableDescriptor instance and a two-dimensional array of hash values
hAdmin.createTable(tableDesc, splitKeys);

3 RowKey design

The unique identification of a piece of data is RowKey. The partition in which the data is stored depends on which pre partition the RowKey is in. The main purpose of designing RowKey is to make the data evenly distributed in all region s and prevent data skew to a certain extent. Next, let's talk about the common design schemes of RowKey.

1. Generate random number, hash and hash value

For example:
Originally rowKey For 1001, SHA1 Then it becomes:
 Originally rowKey 3001, SHA1 Then it becomes:
 Originally rowKey 5001, SHA1 Then it becomes:
 Before doing this, we usually choose to extract samples from the data set to determine what kind of data rowKey come Hash
 As the critical value of each partition.

2. String inversion

20170524000001 To 10000042507102
20170524000002 Converted to 20000042507102

This can also hash the gradually put in data to a certain extent.

3. String splicing


4 memory optimization

HBase operation requires a lot of memory overhead. After all, tables can be cached in memory. Generally, 70% of the whole available memory will be allocated to the Java heap of HBase. However, it is not recommended to allocate very large heap memory, because if the GC process lasts too long, the RegionServer will be unavailable for a long time. Generally, 16~48G memory is enough. If the system memory is insufficient because the framework occupies too much memory, the framework will also be dragged to death by the system service.

5 basic optimization

1. It is allowed to add content to HDFS files


Properties: dfs.support.append
 Explanation: on HDFS Additional synchronization can provide excellent cooperation HBase Data synchronization and persistence. The default value is true. 

2. Optimize the maximum number of file openings allowed by DataNode


Properties: dfs.datanode.max.transfer.threads
 Explanation: HBase Generally, a large number of files are operated at one time. According to the number and scale of clusters and data actions,
Set to 4096 or higher. Default: 4096

3. Optimize the waiting time of data operation with high delay


Properties: dfs.image.transfer.timeout
 Explanation: if the delay is very high for a data operation, socket Need to wait longer, it is recommended to
 This value is set to a larger value (default 60000 milliseconds) to ensure that socket Will not be timeout Drop.

4. Optimize data writing efficiency


 Explanation: opening these two data can greatly improve the file writing efficiency and reduce the writing time. The first attribute value is modified to
true,The second attribute value is modified to: org.apache.hadoop.io.compress.GzipCodec Or its
 He compressed the way.

5. Set the number of RPC listeners


Properties: Hbase.regionserver.handler.count
 Explanation: the default value is 30, which is used to specify RPC The number of listeners can be adjusted according to the number of requests from the client, read and write
 Increase this value when there are many requests.

6. Optimize hsstore file size


Properties: hbase.hregion.max.filesize
 Explanation: the default value is 10737418240 (10) GB),Run if necessary HBase of MR Task, you can reduce this value,
Because one region Corresponding to one map Task, if single region Too large will cause map Task execution time
 Too long. This value means that if HFile If the size of reaches this value, this region Will be cut into two
 individual Hfile. 

7. Optimize HBase client cache


Properties: hbase.client.write.buffer
 Interpretation: used to specify Hbase Client cache, increasing this value can reduce RPC Number of calls, but it will consume more internal resources
 Generally, we need to set a certain cache size to reduce RPC Number of purposes.

8. Specify the number of rows obtained by scan.next scanning HBase


Properties: hbase.client.scanner.caching
 Interpretation: used to specify scan.next Method. The larger the value, the greater the memory consumption.

9. flush, compact and split mechanisms

When the memstore reaches the threshold, the data in the memstore will be flushed into the Storefile; the compact mechanism is to merge the small files flushed into large Storefile files. Split is to split the oversized Region into two when the Region reaches the threshold.

Properties involved:

That is, 128M is the default threshold of the memory

hbase.hregion.memstore.flush.size: 134217728

That is, this parameter is used to flush all the memory stores of a single hregon when the total size of all the memories exceeds the specified value. The flush of RegionServer is processed asynchronously by adding requests to a queue and simulating the production and consumption model. There is a problem here. When the queue is too late to consume and generates a large number of backlog requests, it may lead to Causing a sharp increase in memory, the worst case is to trigger OOM.

hbase.regionserver.global.memstore.upperLimit: 0.4
hbase.regionserver.global.memstore.lowerLimit: 0.38

That is, when the total amount of memory used by MemStore reaches the value specified by hbase.regionserver.global.memstore.upperLimit, multiple memstores will be flushed into the file. The order of MemStore flush is in descending order of size until the memory used by refreshing to MemStore is slightly less than lowerLimit

Posted by svihas on Wed, 15 Sep 2021 19:31:12 -0700