I. hbase -- basic principle and use

Keywords: Big Data HBase Apache Hadoop hive

Hot issues of hbase data:

The solution is to preprocess the rowkey of the hot data, add some prefixes, and distribute the hot data to multiple region s.

Pre merger? Dynamic partition? At the beginning of the initial data, the data should be partitioned, stored in different region s, and load balanced.

Example: for example, it is easy to divide all the phone numbers into one area according to the division at the beginning of the phone number. After that, you can use the reverse order of the phone number as row key, which is more random.

ctrL+backspace is delete

The balance principle of adding hbase nodes up and down

I. hbase overview

1.1 hbase introduction

HBase is a distributed, column oriented open source database, which is suitable for unstructured data storage. Another difference is that HBase is based on columns rather than rows.
Large: hundreds of millions of lines, millions of columns
Column oriented: column (family) oriented storage and permission control, column (cluster) independent retrieval
Sparse: for empty (null) columns, it does not occupy storage space, so the design of tables is very sparse

1.2 HBase architecture

Figure 1.1 hbase architecture

It can be seen that hbase is based on hdfs as data storage. On top of hdfs is hbase, which has three main components: HMaster, HRegionServer and zookeeper. Let's look at the function of each component.

1.2.1 HMaster

HMaster does not store the actual data, but manages the whole cluster. In general, there is only one HMaster in the cluster. If high availability is configured, there is another standby node. The functions are as follows:
1) monitor the RegionServer to see whether it works normally and update the information to the node information of zk
2) handle regional server failover. Assign the above region to another region server
3) handling metadata changes
4) handle the allocation or removal of region s
5) load balancing data in idle time
6) publish your location to the client through Zookeeper
7) allocate new regions after Region Split
8) manage the addition, deletion, modification and query of Table

1.2.2 HRegionServer

There are multiple hregionservers in the cluster, and each HRegionServer is responsible for managing several regions internally. Each region contains several stores, each store contains one memory and at least one storefile, and the storefile contains one hfile. Memstore is responsible for storing data in memory. When the amount of data in memstore reaches the threshold value, it will flush to the storefile and finally write to hdfs.
Because the data of the memstore is not written to the storefile synchronously in real time, but only when certain conditions are met, so if the HRegionServer fails at this time, the data in the memory will be lost. But if write to hdfs synchronously and trigger IO frequently, the performance will be very poor. Therefore, there is an Hlog. An HRegionServer has only one Hlog, which is used to record the update operation of data. This Hlog is written to hdfs in real time to prevent loss. This mechanism is similar to the binary log of mysql.
The functions of HRegionServer are summarized as follows:
1) be responsible for storing HBase's actual data
2) process the Region assigned to it
3) refresh cache to HDFS
4) maintenance of HLog
5) perform compression
6) responsible for processing Region segmentation

Seeing this, there are many questions. For example, what is region? Don't worry. Let's go into details.

1.3 data storage model of HBase

1.3.1 data model of HBase

Generally speaking, in a normal RDBMS, the storage structure of a table is as follows: take the student table as an example, there are several columns such as id, name, sex and pwd

id name sex pwd
1 Zhang San male 123
2 Li Si female 456

In HBASE, the logical structure of the above table is roughly as follows:

row-key column family1 column family2
1 info: {name: "Zhang San", sex: "male"} password: {pwd: 123}
2 info: {name: "Li Si", sex: "female"} password: {pwd: 456}

Are you a little confused? Don't worry, take your time. Some nouns are explained first
rowkey:

This is a row of keys, but the design of the key here is very elegant. It is not necessarily that the id field in the original student table is used as the rowkey (in fact, it is not used alone in production)

column family:

    This is a new concept, which is usually called "column cluster" in Chinese literal translation. It is a set of columns, CF for short. A table can have multiple CFS (the name of CF must be specified when creating a table, but the column name does not need to be specified). A CF can have any number of columns inside it. Only when inserting data can the columns be specified, and which cf the columns are under. That is to say, the Column Family supports dynamic expansion, without defining the number and type of columns in advance. All columns are stored in binary format, and users need to convert their own types when using them.
    For example, the above info and password are cf. there are two columns in CF info, namely name and sex. Each column corresponds to a value. CF for password only has pwd column. In hbase, column is also called qulifier.

Just mentioned, the above is just the logical structure of tables in hbase, so how does hbase actually store tables? As mentioned above, hbase is a columnar database, which is reflected here.
The actual storage structure of hbase is as follows:

rowkey cf:column cell value timestamp
1 info:name Zhang San 1564393837300
1 info:sex male 1564393810196
1 password:pwd 123 1564393788068
2 info:name Li Si 1564393837300
2 info:sex female 1564393810196
2 password:pwd 456 1564393788068

As you can see, each cell is stored as a single row in hbase. If there is no value in the corresponding cf column, then there will be no storage record and no space will be occupied. This is also a typical structure of hbase as columnar storage.
And you need to determine the location of a cell. You need four parameters: rowkey+cf+column+timestamp. The first three are understandable. Why are there multiple timestamps? That's because the value of a cell in hbase has the concept of multiple versions. A cell can have multiple values. How can these values be distinguished from each other? At this time, rowkey+cf+column alone is indistinguishable, so a timestamp is added, which is the last timestamp of the value update, so that a cell can be uniquely determined. By default, the version of a cell's value is one (in fact, the cf version is correct, and the column is only set based on the cf version), and even if there are multiple values, only the latest value is displayed externally. How many versions of a column can be set. If the inserted value exceeds the set version number, the oldest version will be overwritten first.

Do you think that's the physical storage structure of hbase? Not exactly. Let's go on

1.3.2 hbase data storage principle

(1)region

As we know, HRegionServer is responsible for data storage. The HRegionServer internally manages a series of HRegion objects. Each HRegion corresponds to a region in the Table (you can use the same meaning of HRrgion and region later). A Table has at least one region, and one region is represented by [startkey, endkey]. Pay attention to the closed interval. Generally speaking, according to the characteristics of the original data, it is divided into several partitions in advance, and then each partition is managed by a region. As for which region servers are allocated by HMaster. HRegion consists of multiple hstores. Each HStore corresponds to the storage of a Column Family in the Table. It can be seen that each Column Family is actually a centralized storage unit. Therefore, it is best to place the column with common IO characteristics in a Column Family, which is the most efficient.
Each HStore is composed of two parts, MemStore (only one) and storefiles (at least one). Let's talk about these two things.

(2)MemStore & StoreFiles

HStore storage is the core of HBase storage, which consists of two parts: MemStore and StoreFiles. The memory store is a sort memory buffer. The data written by the user will be put into the memory store first. When the memory store is full, it will Flush into a StoreFile (the underlying implementation is HFile). When the number of StoreFiles increases to a certain threshold, it will trigger the compact merge operation. Multiple StoreFiles will be merged into a StoreFile. During the merge process, version merging and data deletion will be performed. Therefore, you can It can be seen that HBase only adds data, and all update and delete operations are carried out in the subsequent compact process, which enables the user's write operations to return as soon as they enter the memory, ensuring the high performance of HBase I/O. After the StoreFiles Compact, a larger and larger StoreFile will be formed gradually. When the size of a single StoreFile exceeds a certain threshold, the Split operation will be triggered. At the same time, the current Region will be Split into two regions, the parent Region will be offline, and the two child regions from the new Split will be assigned to the corresponding HRegionServer by HMaster, so that the pressure of the original Region can be divided into two regions.
Here is a point to consider: the role of compact operation:

As we know, storefile is the data from flush in memstore. Then we can assume that there is a case where the value of a cell is 1 at the beginning, and then the memstore is full, flush to the storefile. Then the value of the cell is changed to 2, and then the memstore is full again. Flush to the storefile. At this time, the cell has multiple values in the storefile (this is not about cell multi version). On the surface, the data in the memstore is modified, but for the underlying storefile, it is only a data increase operation, because the efficiency of adding data is higher than modifying data. Of course, there is also a disadvantage, that is, the same cell stores multiple versions of data, occupying storage space, so this is a space for time strategy. When the number of storefiles increases to a certain amount, multiple storefiles will be merged. At this time, those duplicate data will be removed (only the last value will be retained, and all previous data will be deleted). Finally, a certain amount of storage space will be released to get the latest data. So the merging process is actually to complete the update, modification and deletion operations.

As mentioned above, the data of the memstore will not be flushed to the storefile until the data volume reaches the threshold value. If the region server fails suddenly before the flush, the data in the memory will be lost. What should I do? Don't worry. There's a hlog

(3)hlog

There is only one Hlog object in each HRegionServer, which is shared by all regions. Hlog is a class that implements Write Ahead Log. When there is a write operation, the corresponding operation will be recorded to the Hlog first, and the Hlog is synchronized to the disk in real time, so there is no need to worry about losing the Hlog in case of downtime. Only after the completion of the Hlog return record can it be written to the memstore. This ensures that the data operations in memory will be recorded in the Hlog. Hlog files periodically scroll out new and delete old files (data that has been persisted to StoreFile).
HLog plays an important role in recovering data when the region server fails. When the hregion server is terminated unexpectedly, HMaster will perceive through Zookeeper that it will first process the legacy HLog files, split the Log data of different regions, put them into the directory of the corresponding region, and then reallocate the invalid regions. In the process of loading regions, the hregion server receiving these regions will find that there is a historical HLog that needs to be processed, Therefore, the data in HLog will be replayed to MemStore, and then flush to StoreFiles to complete data recovery.

1.3.3 physical storage file of HBase

First, each table is at least one region, and a CF in each region corresponds to a storefile. So actually different CF's are physical files stored separately. That is to say, in the data model, the physical storage structure of the student table is actually as follows:

The format is: rowkey:cf:column:value:timestamp
hfile for cf--info:
1:info:name:Zhang San: 1564393837300
1:info:sex: male: 1564393837300
2:info:name: Li Si: 1564393837302
2:info:sex: female: 1564393837302

hfile for cf--password:
1:password: pwd: 123: 1564393837300
2:password: pwd: 456: 1564393837300

When we need to query a row of data, we can traverse all hfiles of all region s, and then find the data of the same rowkey. hfile is stored directly in binary mode in hdfs, which is faster.
Next comes the hlog, which is stored in Sequence File at the bottom

1.4 reading and writing process of HBase

hbase exists in two special tables, - ROOT - and. meta, the former is used to record the region information of the meta table, and the latter is used to record the region information of the user table. At present, - ROOT - has been removed. Because it is redundant, you can use the meta table directly. When the meta table is the entry table of the whole hbase cluster, the read and write operations must first access the meta table.

1.4.1 reading process

1) the HRegionServer stores such a table and table data of. META. to access the table data, the Client first accesses the zookeeper and finds the location information of. META. table from the zookeeper, that is, which HRegionServer the. META. table is saved in.
2) then the Client accesses the HRegionServer where the. META. Table is located through the IP of the HRegionServer just obtained, so as to read the. META, and then obtain the metadata stored in the. META. Table.
3) the client accesses the corresponding HRegionServer through the information stored in the metadata, and then scans the location of the
Memory and Storefile of HRegionServer are used to query data.
4) at last, HRegionServer responds the queried data to the Client.

1.4.2 writing process

1) the Client also accesses zookeeper first, finds the regionserver where the. META. Table is located, and obtains the. META. Table information.
2) determine the region server and region corresponding to the data to be written. This process requires the participation of HMaster to decide which region to write data to
3) the client initiates a write data request to the RegionServer, and then the RegionServer receives the request and responds.
4) the client writes the data to the HLog to prevent data loss.
5) then write the data to the memory.
6) if both the Hlog and the memory are written successfully, the data is written successfully. During this process, if the memory reaches the threshold value, the data in the memory will be flush ed to the StoreFile.
7) when there are more and more storefiles, Compact merge operation will be triggered to merge too many storefiles into a large Storefile. When the Storefile becomes larger and larger, the Region will become larger and larger. When the threshold value is reached, the Split operation will be triggered to Split the Region into two parts.

2. hbase deployment

2.1 environmental preparation

Software Edition Host (192.168.50.x/24)
zookeeper (deployed) 3.4.10 bigdata121(50.121),bigdata122(50.122),bigdata123(50.123)
hadoop (deployed) 2.8.4 Where is bigdata 121 (50.121) namenode, bigdata 122 (50.122), bigdata 123 (50.123)
hbase 1.3.1 Where is bigdata121 (50.121) HMaster, bigdata122 (50.122), bigdata123 (50.123)

2.2 start hbase deployment

On bigdata 121

Unzip hbase-1.3.1-bin.tar.gz

tar zxf  hbase-1.3.1-bin.tar.gz  -c /opt/modules/

Modify / opt/modules/hbase-1.3.1-bin/conf/hbase-env.sh

export JAVA_HOME=/opt/modules/jdk1.8.0_144
# Disable the zookeeper provided by hbase, and use the extra installed zookeeper
export HBASE_MANAGES_ZK=false

Modify / opt/modules/hbase-1.3.1-bin/conf/hbase-site.xml

<configuration>
<!--Appoint hbase stay hdfs Storage directory in -->
<property>
<name>hbase.rootdir</name>
<value>hdfs://bigdata121:9000/hbase</value>
</property>

<!--Cluster -->
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>

<!--HMASTER Port -->
<property>
<name>hbase.master.port</name>
<value>16000</value>
</property>

<!--zk Server information of cluster -->
<property>
<name>hbase.zookeeper.quorum</name>
<value>bigdata121:2181,bigdata122:2181,bigdata123:2181</value>
</property>

<!--zk Data directory for -->
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/opt/modules/zookeeper-3.4.10/zkData</value>
</property>

<!--HMaster And regionserver Maximum time difference between in seconds -->
<property>
<name>hbase.master.maxclockskew</name>
<value>180000</value>
</property>
</configuration>

Configure environment variables:

vim /etc/profile.d/hbase.sh
#!/bin/bash
export HBASE_HOME=/opt/modules/hbase-1.3.1
export PATH=$PATH:${HBASE_HOME}/bin

//Then? 
source /etc/profile.d/hbase.sh

After configuration, scp the whole hbase directory to the / opt/modules directory of bigdata122 and bigdata123.
Don't forget to configure environment variables.

2.3 start hbase cluster

Start / shut down the entire cluster

start-hbase.sh
stop-hbase.sh

Tips:
If the HMaster is hung in the cluster, you will find that you cannot use stop hbase.sh to shut down the cluster. At this time, please manually shut down other region server s.

Start and shut down clusters and nodes separately:

hbase-daemon.sh start/stop master
hbase-daemon.sh start/stop regionserver

After the start, you can see the web management page of hbase through HMasterIP: 16010

2.4 zookeeper cluster nodes

After hbase cluster is started, a / hbase node will be created, under which multiple child nodes maintaining hbase cluster information will be created

[zk: localhost:2181(CONNECTED) 1] ls /hbase
[replication, meta-region-server, rs, splitWAL, backup-masters, table-lock, flush-table-proc, region-in-transition, online-snapshot, switch, master, running, recovering-regions, draining, namespace, hbaseid, table]

Among them:
rs
 This node has the information of regionserver, named in the format of "hostname,port,id"
[zk: localhost:2181(CONNECTED) 6] ls /hbase/rs
[bigdata122,16020,1564564185736, bigdata121,16020,1564564193102, bigdata123,16020,1564564178848]
The nodes under this node indicate that the corresponding regionserver is running normally, and the node information corresponding to the offline regionserver will be deleted because it is just a temporary node. Please refer to the articles in the zookeeper series for specific characteristics of the temporary node.

meta-region-server
 The value of this node holds the information of the region server storing the meta table

master
 Host information of the current HMaster

backup-masters
 The information of the standby master node is empty if it is not configured

namespace
 Each of the following child nodes corresponds to a namespace, which is equivalent to the concept of a library in RDBMS

hbaseid
 value records the unique id of hbase cluster

table
 Each of the following child nodes corresponds to a table

2.5 node management of region server

2.5.1 add nodes

You can start a new node with the following command

hbase-daemon.sh start regionserver 

At the beginning, the new node had no data. If the balancer is turned on at this time, HMaster will schedule the region s of other nodes to move to this new node, that is to say, data balance.
After starting the node, check the status of the balancer in hbase shell

Balancer "enabled" returns the current status of the balancer, which is false by default

Switch balancer on / off

balance_switch true/false

Where there are small pits:

There is a balance [switch status] command. I think it's used to query the current status of the balancer. After being hit, I found a problem. After repeated tests, the conclusion is as follows:
After the command is executed, no matter what the current state of the balancer is, it will be changed to false, that is, the off state.
And the result returned by the command is the last state of the balancer. Note that it is the last time, not the current state.
This is the place where the order is made. Who the hell, who the tm designed it.

So don't execute this command randomly. If you execute this command, the balancer will shut down for you directly.

2.5.2 offline node

When we want to offline a node, the general steps are as follows:
Stop balancer first

balance_switch false

Then stop the regionserver on the node

hbase-daemon.sh stop regionserver

After the node is shut down, all regions on the original node cannot be accessed and are in maintenance status. Then the temporary nodes corresponding to / hbase/rs / on ZK will disappear (for the characteristics of ZK temporary nodes, see my previous article on zk). After the master node finds the node information changes in ZK, it will detect that the region server is offline, automatically turn on the balancer, and migrate the region on the offline server to other servers.

The biggest disadvantage of this method is that after the server is shut down, the above regions will be disabled. And because the data is saved in hdfs+hlog, when migrating regions later, you need to read data from hdfs and perform operations in hlog again to recover the complete region. The operation of reading hdfs and executing hlog is very slow. This causes these regions to be inaccessible for a long time. Because hbase provides another way to smooth the offline nodes.

In the bin directory of hbase, execute

graceful_stop.sh <RegionServer-hostname>

This command will shut down the balancer first, and then assign region directly. After all regions are migrated, the server will be shut down. This makes full use of the region data in memory, reduces the amount of data read from hdfs, and does not need to perform the operations in the hlog, which is much faster. So the time for region to pause access is also shortened

III. use of hbase

Enter the command line:

hbase shell

View command help

hbase(main)> help

3.1 basic namespace operation command

See which tables are in the current namespace (default by default)

hbase(main)> list

See which namespaces are available, similar to the concept of Libraries in RDBMS

hbase(main)> list_namespace

View the tables in the specified namespace

hbase(main)> list_namespace_tables 'namespace_name'

Create a namespace

hbase(main)>create_namespace 'namespace'

To view namespace information:

hbase(main)> describe_namespace 'namespace'

3.2 table basic operation command

Create table

Create 'namespace: table name', 'CF1', 'CF2', 'CFX', {para1 = > value, para2 = > value,} does not specify a namespace. The default is the default namespace
 Example:
HBase (main) > create 'student','info 'creates the student table, and the column cluster has info
 HBase (main) > create 'student', 'info', {versions = > 3} creates a student table with info in the column cluster and 3 versions

If you need to set different parameter properties for different cf, you need to create the table in the following way
create 'teacher_2',{NAME=>'info',VERSIONS=>3},{NAME=>'password',VERSIONS=>2}
Create the table teacher? 2, CF is info and password, and the number of versions is 3 and 2

Insert data (update data is the same command, the same operation)

Put 'namespace:table', 'rowkey', 'cf:colume', 'value', [timestamp]
 If [timestamp] is not specified, it defaults to the current time. Only one cell can be inserted at a time

 Example:
hbase(main) > put 'student','1001','info:name','Thomas'

View table data

scan 'namespace:table',{param1=>value}

Example:
Scan the whole table: scan 'student'
Scan specified fields: scan 'student', {columns = > ['info: name ','info: sex']}
Limit the number of rows returned: scan 'student' {limit = > 1} actually returns n+1 rows
 Return the data of the specified rowkey range: scan 'student', {STARTROW = > '1001', STOPROW = > '1002'}. You can use STARTROW and STOPROW separately
 Return the data of the specified timestamp range: scan 'student' {timing = > [1303668804, 1303668904]}

View table structure

desc 'namespace:table'

//Example:
desc 'student'

//The printed contents are as follows:
Table student is ENABLED                          
student
COLUMN FAMILIES DESCRIPTION                                           {NAME => 'info', BLOOMFILTER => 'ROW', VERSIONS => '3', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION =>
 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'
 //You can see the column cluster information of the table

View the data of the specified row or column (scan can also implement it)

get 'namespace:table','rowkey','cf:column',{para1=value....}

Example:
get 'student','1001','info:name'
Get 'student', '1001', 'info: name', {versions = > 2} view the data of the first two versions

Note: this command can only be used to query the data of a single row (data of the same rowkey)

Delete data

delete 'namespace:table','rowkey','cf:column'
Used to delete data for a specified field

deleteall 'namespace:table','rowkey'
Data used to delete the same rowkey

Disable / enable / view table status

Check whether the table is enabled: is? Enabled 'namespace: table'
Enable table: enable 'namespace:table'
Disable table: disable 'namespace:table' after disabling a table, it cannot be read or written

Clear table data

To disable the table before emptying the data
truncate 'namespace:table'

Delete table

Confirm that the table is in the enabled state and cannot be deleted in the disabled state
drop 'namespace:table'

Statistical row number

count  'namespace:table'

Change form information

alter 'namespace:table',{param1:value...}
Example:
Alter 'student', {name ='info ', versions = > 5} changes the number of versions of column cluster info to 5
 alter 'student',{NAME='info:name',METHOD='delete'} delete field info:name
 Alter 'student', {name = >'address'info '} add column cluster address'info'

Check list exists

exist 'namespace:table'

View the node status of the current hbase cluster

status
 The message is as follows:
1 active master, 1 backup masters, 3 servers, 0 dead, 17.0000 average load
 They are: the status of master and standby master, the number of regionserver survivals, the number of deaths, and the average load

3.3 using hbase java api

Create a new maven project and add the following dependencies to pom.xml

<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-server</artifactId>
    <version>1.3.1</version>
</dependency>

<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-client</artifactId>
    <version>1.3.1</version>
</dependency>

3.3.1 judge whether the table exists

public class HbaseTest01 {
    public static Configuration conf;
    static{
        //Use the single instance method of HBaseConfiguration to instantiate and configure zk cluster ip, port and zk node name
        conf = HBaseConfiguration.create();
        conf.set("hbase.zookeeper.quorum", "bigdata121");
        conf.set("hbase.zookeeper.property.clientPort", "2181");
        conf.set("zookeeper.znode.parent", "/hbase");
    }

    public static boolean isTableExist(String tableName) throws IOException {
        //Create connection object according to conf
        Connection connection = ConnectionFactory.createConnection(conf);
        //Get the admin administrator object through the connection to manage the table
        HBaseAdmin admin = (HBaseAdmin)connection.getAdmin();
        return admin.tableExists(tableName);

    }
}

3.3.2 create table

public static void createTable(String tableName, String... columnFamily) throws IOException {
        Connection connection = ConnectionFactory.createConnection(conf);
        HBaseAdmin admin = (HBaseAdmin)connection.getAdmin();
        //Create a table description object
        HTableDescriptor hTableDescriptor = new HTableDescriptor(TableName.valueOf(tableName));
        for (String cf: columnFamily) {
            //Each cf creates a field description object and adds it to the table description object
            hTableDescriptor.addFamily(new HColumnDescriptor(cf));
        }
        //Create table
        admin.createTable(hTableDescriptor);

    }

//Note: if you do not specify a namespace when creating a table, the default is the default namespace. If you need to specify a namespace, you need to name the created table name in the form of "namespace:tableName", separated by colons

3.3.3 delete table

  public static void deleteTable(String tableName) throws IOException {
        Connection connection = ConnectionFactory.createConnection(conf);
        HBaseAdmin admin = (HBaseAdmin)connection.getAdmin();
        //Disabled table
        admin.disableTable(tableName);
        //Delete table
        admin.deleteTable(tableName);
    }

3.3.4 inserting data

    public static void putData(String tableName, String rowKey, String columnFamily, String column, String value) throws IOException {
        Connection connection = ConnectionFactory.createConnection(conf);
        //Get table management object through connection object
        Table table = connection.getTable(TableName.valueOf(tableName));
        //Create row object
        Put put = new Put(rowKey.getBytes());
        //Add column to row, write value
        put.addColumn(columnFamily.getBytes(), column.getBytes(), value.getBytes());
        //Commit rows to table for changes
        table.put(put);
        table.close();

    }
}

3.3.5 delete line

    public static void deleteData(String tableName, String... rowKey) throws IOException {
        Connection connection = ConnectionFactory.createConnection(conf);
        //Get table management object through connection object
        Table table = connection.getTable(TableName.valueOf(tableName));
        //Create delete object
        ArrayList<Delete> deleteList = new ArrayList<>();
        for (String row: rowKey) {
            deleteList.add(new Delete(row.getBytes()));
        }
        //Commit row to table for deletion
        table.delete(deleteList);
        table.close();

    }

3.3.6 query data or specify CF and "CF:COLUMN"

public static void scanData(String tableName) throws IOException {
        Connection connection = ConnectionFactory.createConnection(conf);
        //Get table management object through connection object
        Table table = connection.getTable(TableName.valueOf(tableName));
        //To create a scanner, you can set startRow and stoprow to read the data within the specified key range
        Scan scan = new Scan();
        //Scan table with scanner
        ResultScanner scanner = table.getScanner(scan);
        for (Result result: scanner) {
            Cell[] cells = result.rawCells();
            for (Cell cell:cells) {
                //Get rowkey
                System.out.println("Row key:" + Bytes.toString(CellUtil.cloneRow(cell)));
                //Get column family
                System.out.println("Column family" + Bytes.toString(CellUtil.cloneFamily(cell)));
                System.out.println("column:" + Bytes.toString(CellUtil.cloneQualifier(cell)));
                System.out.println("value:" + Bytes.toString(CellUtil.cloneValue(cell)));
            }
        }
        table.close();
        connection.close();
    }

//Query the specified CF, specify "CF:COLUMN", you can add the column to scan in the scanner or CF
scan.addColumn(family,column);
scan.addFamily(cf.getBytes())

3.3.7 get a row of data

public static void getRow(String tableName, String rowKey) throws IOException {
        Connection connection = ConnectionFactory.createConnection(conf);
        //Get table management object through connection object
        Table table = connection.getTable(TableName.valueOf(tableName));
        Get get = new Get(rowKey.getBytes());
        Result result = table.get(get);
        for (Cell cell:result.rawCells()) {
            //Get rowkey
            System.out.println("Row key:" + Bytes.toString(CellUtil.cloneRow(cell)));
            //Get column family
            System.out.println("Column family" + Bytes.toString(CellUtil.cloneFamily(cell)));
            System.out.println("column:" + Bytes.toString(CellUtil.cloneQualifier(cell)));
            System.out.println("value:" + Bytes.toString(CellUtil.cloneValue(cell)));
            System.out.println("time stamp:" + cell.getTimestamp());
        }

        table.close();
        connection.close();
  }

3.3.8 get "CF:COLUMN" specified in a row

public static void getRowCF(String tableName, String rowKey, String family, String column) throws IOException {
        Connection connection = ConnectionFactory.createConnection(conf);
        //Get table management object through connection object
        Table table = connection.getTable(TableName.valueOf(tableName));
        Get get = new Get(rowKey.getBytes());
        get.addColumn(family.getBytes(),column.getBytes());
        Result result = table.get(get);
        for (Cell cell:result.rawCells()) {
            //Get rowkey
            System.out.println("Row key:" + Bytes.toString(CellUtil.cloneRow(cell)));
            //Get column family
            System.out.println("Column family" + Bytes.toString(CellUtil.cloneFamily(cell)));
            System.out.println("column:" + Bytes.toString(CellUtil.cloneQualifier(cell)));
            System.out.println("value:" + Bytes.toString(CellUtil.cloneValue(cell)));
            System.out.println("time stamp:" + cell.getTimestamp());
        }
        table.close();
        connection.close();
    }

3.3.9 create a namespace

public static void createNamespace(String namespace) throws IOException {
        Connection connection = ConnectionFactory.createConnection(conf);
        Admin admin = connection.getAdmin();
        //Create a namespace description object
        NamespaceDescriptor province = NamespaceDescriptor.create(namespace).build();
        //Create a namespace
        admin.createNamespace(province);
    }

3.4 MapReduce and hbase

3.4.1 environmental preparation

View the dependencies required by hbase to run MapReduce task

hbase mapredcp

Add dependency path to environment variable

export HADOOP_CLASSPATH=`hbase mapredcp`

3.4.2 official MapReduce examples

(1) how many rows are there in the statistical table

cd /opt/modules/hbase-1.3.1/lib
 yarn jar  hbase-server-1.3.1.jar  rowcounter student

//The execution result shows that:
org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper$Countes
ROWS=3

(2) use MapReduce to import the data in hdfs into hbase

vim /tmp/fruit_input.txt
1001    apple   red
1002    pear    yellow
1003    orange  orange

//Upload to hdfs
hdfs dfs -mkdir /input_fruit
hdfs dfs -put /tmp/fruit_input.txt /input_fruit/

hbase Create target table in:
hbase(main)> create 'fruit_input','info'

yarn jar hbase-server-1.3.1.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,info:name,info:color fruit_input hdfs://bigdata121:900/input_fruit
//Explanation: - dimporttsv. Columns = HBase \ row \ key, info: name, info: color specifies the correspondence of imported fields, separated by commas

//View data:
hbase(main):002:0> scan 'fruit_input'
ROW        COLUMN+CELL             
1001      column=info:color, timestamp=1564710439420, value=red                         1001      column=info:name, timestamp=1564710439420, value=apple                       1002      column=info:color, timestamp=1564710439420, value=yellow                     1002      column=info:name, timestamp=1564710439420, value=pear                         1003      column=info:color, timestamp=1564710439420, value=orange                     1003      column=info:name, timestamp=1564710439420, value=orange 

3.4.3 read data analysis results from hbase and write them to hbase

Requirement: import the data of some column clusters of the fruit table into the fruit Gu mr table through mr. Extract the name and color in the info column cluster into the fruit_mr table
The fruit table is as follows:

ROW        COLUMN+CELL
1001      column=account:sells, timestamp=1564393837300, value=20   
1001      column=info:color, timestamp=1564393810196, value=red       
1001      column=info:name, timestamp=1564393788068, value=apple     
1001      column=info:price, timestamp=1564393864714, value=10       
1002      column=account:sells, timestamp=1564393937058, value=100   
1002      column=info:color, timestamp=1564393908332, value=orange   
1002      column=info:name, timestamp=1564393897787, value=orange     
1002      column=info:price, timestamp=1564393918141, value=8

Create output table in advance:

hbase(main):002:0> create 'fruit_mr','info'

mapper:

package HBaseMR;

import org.apache.hadoop.hbase.Cell;
import org.apache.hadoop.hbase.CellUtil;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapper;
import org.apache.hadoop.hbase.util.Bytes;

import java.io.IOException;

/**
 * TableMapper<ImmutableBytesWritable, Put>  The output kv type of map is specified here
 * Since the input is from the table of hbase and the KV type of input is constant, there is no need to specify
 *
 * Then if rowkey is used as the key in hbase, the type is immutablebyteswriteable
 */
public class HBaseMrMapper extends TableMapper<ImmutableBytesWritable, Put> {
    /**
     * cell Store the value information of a row in hbase physical storage
     * @param key
     * @param value
     * @param context
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void map(ImmutableBytesWritable key, Result value, Context context) throws IOException, InterruptedException {
        Put put = new Put(key.get());

        //Filter out the column name and color in the column cluster info and put them into the put object
        for (Cell cell : value.rawCells()) {
            if ("info".equals(Bytes.toString(CellUtil.cloneFamily(cell)))) {
                if ("name".equals(Bytes.toString(CellUtil.cloneQualifier(cell))) || "color".equals(Bytes.toString(CellUtil.cloneQualifier(cell)))) {
                    put.add(cell);
                }
            }
        }

        //If put is not empty, it will be written to Context. Otherwise, an error "null value cannot be written" will be reported when it is finally written to hbase
        if (! put.isEmpty()) {
            context.write(key, put);
        }

    }
}

reducer:

package HBaseMR;

import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.io.NullWritable;

import java.io.IOException;

/**
 * Inherit tablereducer < keyin, valuein, keyout >
    You do not need to specify the type of output value of reduce here, because it must be Put type
 */
public class HBaseMrReducer extends TableReducer<ImmutableBytesWritable, Put, NullWritable> {
    @Override
    protected void reduce(ImmutableBytesWritable key, Iterable<Put> values, Context context) throws IOException, InterruptedException {

        //Write the same key to Context
        for (Put p : values) {
            context.write(NullWritable.get(), p);
        }

    }
}

runner:

package HBaseMR;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class HBaseMrRunner extends Configured implements Tool {
    @Override
    public int run(String[] strings) throws Exception {
        //Create job object
        Configuration conf = this.getConf();
        Job job = Job.getInstance(conf, this.getClass().getSimpleName());
        job.setJarByClass(HBaseMrRunner.class);

        //Create a scanner to scan hbase table data
        Scan scan = new Scan();
        scan.setCacheBlocks(false);
        scan.setCaching(500);

        //Set job parameters, including map and reduce
        //Set map input, class, output kv class
        TableMapReduceUtil.initTableMapperJob(
                "fruit",
                scan,
                HBaseMrMapper.class,
                ImmutableBytesWritable.class,
                Put.class,
                job
        );

        //Set the reducer class and output the table
        TableMapReduceUtil.initTableReducerJob(
                "fruit_mr",
                HBaseMrReducer.class,
                job
        );

        job.setNumReduceTasks(1);

        //Submit job
        boolean isSuccess = job.waitForCompletion(true);

        return isSuccess? 1:0;
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = HBaseConfiguration.create();
        //Call the run method in runner
        int status = ToolRunner.run(conf, new HBaseMrRunner(), args);
        System.exit(status);

    }
}

Use maven packaging to run on the cluster:

yarn jar hbasetest-1.0-SNAPSHOT.jar HBaseMR.HBaseMrRunner

3.4.4 import hdfs text data into hbase

Import the data of / input ﹣ fruit / fruit ﹣ input.txt in hdfs into the hbase table fruit ﹣ hdfs ﹣ Mr
The text format is as follows:

1001    apple   red
1002    pear    yellow
1003    orange  orange
 Use "\ t" to separate fields

Create the table first:

create 'fruit_hdfs_mr','info'

mapper:

package HDFSToHBase;

import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class ToHBaseMapper extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put> {
    ImmutableBytesWritable keyOut = new ImmutableBytesWritable();
    //Put value = new Put();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String[] fields = line.split("\t");

        keyOut.set(fields[0].getBytes());
        Put put = new Put(fields[0].getBytes());
        put.addColumn("info".getBytes(), "name".getBytes(), fields[1].getBytes());
        put.addColumn("info".getBytes(), "color".getBytes(), fields[2].getBytes());

        context.write(keyOut, put);
    }
}

reducer:

package HDFSToHBase;

import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.io.NullWritable;

import java.io.IOException;

public class ToHBaseReducer extends TableReducer<ImmutableBytesWritable, Put, NullWritable> {
    @Override
    protected void reduce(ImmutableBytesWritable key, Iterable<Put> values, Context context) throws IOException, InterruptedException {
        for (Put p : values) {
            context.write(NullWritable.get(), p);
        }
    }
}

runner:

package HDFSToHBase;

import HBaseMR.HBaseMrRunner;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class ToHBaseRunner extends Configured implements Tool {
    @Override
    public int run(String[] strings) throws Exception {
        //Create job object
        Configuration conf = this.getConf();
        Job job = Job.getInstance(conf, this.getClass().getSimpleName());
        job.setJarByClass(ToHBaseRunner.class);

        //Set data input path
        Path inPath = new Path("/input_fruit/fruit_input.txt");
        FileInputFormat.addInputPath(job, inPath);

        //Set map class, KV type of output
        job.setMapperClass(ToHBaseMapper.class);
        job.setMapOutputKeyClass(ImmutableBytesWritable.class);
        job.setMapOutputValueClass(Put.class);

        //Set the reduce class and output the table
        TableMapReduceUtil.initTableReducerJob(
                "fruit_hdfs_mr",
                ToHBaseReducer.class,
                job
        );

        job.setNumReduceTasks(1);

        boolean isSuccess = job.waitForCompletion(true);

        return isSuccess?0:1;

    }

    public static void main(String[] args) throws Exception {
        Configuration configuration = HBaseConfiguration.create();
        int status = ToolRunner.run(configuration, new ToHBaseRunner(), args);
        System.exit(status);

    }
}

Package and run jar package:

yarn jar hbasetest-1.0-SNAPSHOT.jar HDFSToHBase.ToHBaseRunner

3.5 combination of hive and hbase

The hive version used is 1.2. Please refer to the previous hive related articles for deployment of hive.

3.5.1 environment configuration

Hive needs to operate on hbase. It needs to copy some dependent jars under the lib directory of hbase to the lib directory of hive. Hive needs to access the zookeeper cluster to access hbase, so the corresponding jars of zk need to be copied.

hbase Dependence:
cp /opt/modules/hbase-1.3.1/lib/hbase-* /opt/modules/hive-1.2.1-bin/lib/
cp /opt/modules/hbase-1.3.1/lib/htrace-core-3.1.0-incubating.jar
/opt/modules/hive-1.2.1-bin/lib/

zookeeper Dependence:
cp /opt/modules/hbase-1.3.1/lib/zookeeper-3.4.6.jar /opt/modules/hive-1.2.1-bin/lib/

Next, modify the hive configuration file conf/hive-site.xml, and add the following configuration items

<!-- Appoint zk Address and port of the cluster-->
<property>
    <name>hive.zookeeper.quorum</name>
    <value>bigdata121,bigdata122,bigdata123</value>
    <description>The list of ZooKeeper servers to talk to. This is only needed for read/write locks.</description>
</property>

<property>
    <name>hive.zookeeper.client.port</name>
    <value>2181</value>
    <description>The port of ZooKeeper servers to talk to. This is only needed for read/write locks.</description>
</property>

3.5.2 correlation and problems of hive and hbase

(1) create an association table in hive:

create table student_hbase_hive(
id int,
name string,
sex string,
score double)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:name,info:sex,info:score")
TBLPROPERTIES("hbase.table.name"="hbase_hive_student");

//Statement explanation:
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
//The stored class uses the hbase class

WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:name,info:sex,info:score")
//Define the mapping relationship between the fields of tables in hive and the fields of hbase, and map in order

TBLPROPERTIES("hbase.table.name"="hbase_hive_student");
//The parameter of the table created in hbase, where the table name is specified as hbase [hive] student

Error reporting episode
During creation, the following errors are reported:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hbase.HTableDescriptor.addFamily(Lorg/apache/hadoop/hbase/HColumnDescriptor;)V

It's not very detailed. Then look at the error reporting information of the detailed point, print out the debug information, and start hive in the following way

hive -hiveconf hive.root.logger=DEBUG,console

Then execute the above creation statement again, and a lot of information appears. Let's scroll down to a key message:

ERROR exec.DDLTask: java.lang.NoSuchMethodError: org.apache.hadoop.hbase.HTableDescriptor.addFamily......

This means that there is no HTableDescriptor.addFamily method in org.apache.hadoop.hbase.HTableDescriptor.addFamily.
Then I download the corresponding version of hbase dependency with maven under IDEAL, and find that there is no HTableDescriptor.addFamily method. The problem is obvious. Some packages of hbase used by hive are not compatible with hbase we used. And one of the important packages used to correlate hbase and hive is the one we used above
The package corresponding to org.apache.hadoop.hive.hbase.HbaseStorageHandler is actually hive-hbase-handler-1.2.1.jar. Guess it's because the package version is too old and incompatible with the current HBase, so we'll download the new package hive-hbase-handler-2.3.2.jar in maven, just this version. Then replace the original package in the lib directory of hive, restart hive, and then execute the above statement to create the table. It is found that the normal execution is perfect.

After executing the create statement, enter hive and hbase to find that new tables have been created. Then, insert data from hbase or hive into this table, and you can see the inserted data on both sides.

(2) import data into association table

When importing data to an associated table in hive, you cannot directly use the load command to import data. You can only import data from other tables through

insert into table TABLE_NAME select * from ANOTHER_TABLE

Or insert data line by line. There's not much to say here.

(3) existing tables in hbase associated with hive

Because the sql operations provided by hbase are not very powerful, sometimes it is difficult to perform sql statistics on the data. Therefore, you can associate the existing data tables of hbase to hive, and then perform statistical analysis through a more complete HQL in hive. The way to create the association table is the same as above, which is not repeated here.

(4) the nature of correlation between hive and hbase

In fact, the data is stored in hbase in essence. Hive can only operate the data in the table in hbase through the interface. But there is a pit here. In hive, there are fields of type, such as int. However, there is no type in hbase, or all the fields are of string type, and then they are stored directly in binary mode. If you directly query the corresponding data in hbase at this time, you will find that the displayed code is garbled, because hbase cannot recognize the data type at all. Sometimes there is a hole in this point. Pay attention to it.

3.6 sqoop--MysqlToHbase

Sqoop is deployed together with previous hive deployment, so the deployment of sqoop depends on the hive related documents I wrote before.
Modify the configuration file conf/sqoop-env.sh

export HBASE_HOME=/opt/modules/hbase-1.3.1

Requirement: extract the table data in mysql into hbase.
Create mysql table and import data:

CREATE DATABASE db_library;
CREATE TABLE db_library.book(
id int(4) PRIMARY KEY NOT NULL AUTO_INCREMENT, 
name VARCHAR(255) NOT NULL, 
price VARCHAR(255) NOT NULL);

INSERT INTO db_library.book (name, price) VALUES('Lie Sporting', '30');  
INSERT INTO db_library.book (name, price) VALUES('Pride & Prejudice', '70');  
INSERT INTO db_library.book (name, price) VALUES('Fall of Giants', '50');

Create target table in hbase

create 'hbase_book','info'

Import through sqoop

sqoop import \
--connect jdbc:mysql://bigdata11:3306/db_library \
--username root \
--password 000000 \
--table book \
--columns "id,name,price" \
--column-family "info" \     Specified column cluster
--hbase-create-table \
--hbase-row-key "id" \       Specifies which field is mapped to rowkey
--hbase-table "hbase_book" \ Destination table name
--num-mappers 1 \
--split-by id

Posted by daniel_mintz on Mon, 04 Nov 2019 16:20:41 -0800