This article is right [hard big data learning route] learning guide for experts from zero to big data (fully upgraded version) HBase partial supplement.
1 high availability
In HBase, HMaster is responsible for monitoring the lifecycle of HRegionServer and balancing the load of regional server. If HMaster fails, the whole HBase cluster will fall into an unhealthy state, and the working state will not last too long. Therefore, HBase supports the highly available configuration of HMaster.
1. Close the HBase cluster (if not, skip this step)
[atguigu@hadoop102 hbase]$ bin/stop-hbase.sh
2. Create the backup masters file in the conf directory
[atguigu@hadoop102 hbase]$ touch conf/backup-masters
3. Configure the highly available HMaster node in the backup masters file
[atguigu@hadoop102 hbase]$ echo hadoop103 > conf/backup-masters
4. scp the entire conf directory to other nodes
[atguigu@hadoop102 hbase]$ scp -r conf/ hadoop103:/opt/module/hbase/ [atguigu@hadoop102 hbase]$ scp -r conf/ hadoop104:/opt/module/hbase/
5. Open the test page to view
http://hadooo102:16010
2 pre zoning
Each Region maintains StartRow and EndRow. If the added data meets the RowKey range maintained by a Region, the data will be handed over to the Region for maintenance. According to this principle, we can roughly plan the partition where the data will be put in advance to improve the performance of HBase.
1. Manually set the pre partition
Hbase> create 'staff1','info','partition1',SPLITS => ['1000','2000','3000','4000']
2. Generate hexadecimal sequence pre partition
create 'staff2','info','partition2',{NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}
3. Pre partition according to the rules set in the file
Create the splits.txt file as follows:
aaaa bbbb cccc dddd
Then execute:
create 'staff3','partition3',SPLITS_FILE => 'splits.txt'
4. Create pre partition using Java API
//Customize the algorithm to generate a series of hash values, which are stored in a two-dimensional array byte[][] splitKeys = A hash valued function //Create an HbaseAdmin instance HBaseAdmin hAdmin = new HBaseAdmin(HbaseConfiguration.create()); //Create HTableDescriptor instance HTableDescriptor tableDesc = new HTableDescriptor(tableName); //Create a pre partitioned Hbase table from an HTableDescriptor instance and a two-dimensional array of hash values hAdmin.createTable(tableDesc, splitKeys);
3 RowKey design
The unique identification of a piece of data is RowKey. The partition in which the data is stored depends on which pre partition the RowKey is in. The main purpose of designing RowKey is to make the data evenly distributed in all region s and prevent data skew to a certain extent. Next, let's talk about the common design schemes of RowKey.
1. Generate random number, hash and hash value
For example: Originally rowKey For 1001, SHA1 Then it becomes: dd01903921ea24941c26a48f2cec24e0bb0e8cc7 Originally rowKey 3001, SHA1 Then it becomes: 49042c54de64a1e9bf0b33e00245660ef92dc7bd Originally rowKey 5001, SHA1 Then it becomes: 7b61dec07e02c188790670af43e717f0f46e8913 Before doing this, we usually choose to extract samples from the data set to determine what kind of data rowKey come Hash As the critical value of each partition.
2. String inversion
20170524000001 To 10000042507102 20170524000002 Converted to 20000042507102
This can also hash the gradually put in data to a certain extent.
3. String splicing
20170524000001_a12e 20170524000001_93i7
4 memory optimization
HBase operation requires a lot of memory overhead. After all, tables can be cached in memory. Generally, 70% of the whole available memory will be allocated to the Java heap of HBase. However, it is not recommended to allocate very large heap memory, because if the GC process lasts too long, the RegionServer will be unavailable for a long time. Generally, 16~48G memory is enough. If the system memory is insufficient because the framework occupies too much memory, the framework will also be dragged to death by the system service.
5 basic optimization
1. It is allowed to add content to HDFS files
hdfs-site.xml,hbase-site.xml
Properties: dfs.support.append Explanation: on HDFS Additional synchronization can provide excellent cooperation HBase Data synchronization and persistence. The default value is true.
2. Optimize the maximum number of file openings allowed by DataNode
hdfs-site.xml
Properties: dfs.datanode.max.transfer.threads Explanation: HBase Generally, a large number of files are operated at one time. According to the number and scale of clusters and data actions, Set to 4096 or higher. Default: 4096
3. Optimize the waiting time of data operation with high delay
hdfs-site.xml
Properties: dfs.image.transfer.timeout Explanation: if the delay is very high for a data operation, socket Need to wait longer, it is recommended to This value is set to a larger value (default 60000 milliseconds) to ensure that socket Will not be timeout Drop.
4. Optimize data writing efficiency
mapred-site.xml
Properties: mapreduce.map.output.compress mapreduce.map.output.compress.codec Explanation: opening these two data can greatly improve the file writing efficiency and reduce the writing time. The first attribute value is modified to true,The second attribute value is modified to: org.apache.hadoop.io.compress.GzipCodec Or its He compressed the way.
5. Set the number of RPC listeners
hbase-site.xml
Properties: Hbase.regionserver.handler.count Explanation: the default value is 30, which is used to specify RPC The number of listeners can be adjusted according to the number of requests from the client, read and write Increase this value when there are many requests.
6. Optimize hsstore file size
hbase-site.xml
Properties: hbase.hregion.max.filesize Explanation: the default value is 10737418240 (10) GB),Run if necessary HBase of MR Task, you can reduce this value, Because one region Corresponding to one map Task, if single region Too large will cause map Task execution time Too long. This value means that if HFile If the size of reaches this value, this region Will be cut into two individual Hfile.
7. Optimize HBase client cache
hbase-site.xml
Properties: hbase.client.write.buffer Interpretation: used to specify Hbase Client cache, increasing this value can reduce RPC Number of calls, but it will consume more internal resources Generally, we need to set a certain cache size to reduce RPC Number of purposes.
8. Specify the number of rows obtained by scan.next scanning HBase
hbase-site.xml
Properties: hbase.client.scanner.caching Interpretation: used to specify scan.next Method. The larger the value, the greater the memory consumption.
9. flush, compact and split mechanisms
When the memstore reaches the threshold, the data in the memstore will be flushed into the Storefile; the compact mechanism is to merge the small files flushed into large Storefile files. Split is to split the oversized Region into two when the Region reaches the threshold.
Properties involved:
That is, 128M is the default threshold of the memory
hbase.hregion.memstore.flush.size: 134217728
That is, this parameter is used to flush all the memory stores of a single hregon when the total size of all the memories exceeds the specified value. The flush of RegionServer is processed asynchronously by adding requests to a queue and simulating the production and consumption model. There is a problem here. When the queue is too late to consume and generates a large number of backlog requests, it may lead to Causing a sharp increase in memory, the worst case is to trigger OOM.
hbase.regionserver.global.memstore.upperLimit: 0.4 hbase.regionserver.global.memstore.lowerLimit: 0.38
That is, when the total amount of memory used by MemStore reaches the value specified by hbase.regionserver.global.memstore.upperLimit, multiple memstores will be flushed into the file. The order of MemStore flush is in descending order of size until the memory used by refreshing to MemStore is slightly less than lowerLimit