HBase与Hive、Phoenix的集成&节点的管理

HBase与Hive的集成

HBase与Hive的对比

	Hive	HBase
特点	类SQL数据仓库	NoSQL (key - value)
适用场景	离线数据分析和清洗	适合在线业务
延迟	延迟高	延迟低
存储位置	存储在HDFS	存储在HDFS

HBase与Hive集成使用

案例一：建立Hive表，关联HBase表，插入数据到Hive表的同时也能够影响HBase表。

在Hive中创建表的同时关联HBase

CREATE TABLE hive_hbase_emp_table1(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:ename,info:job,info:mgr,info:hiredate,info:sal,info:comm,info:deptno")
TBLPROPERTIES ("hbase.table.name" = "hbase_emp_table1");

完成之后，可以分别进入Hive和HBase查看，都生成了对应的表。

在Hive中创建临时中间表，用于load文件中的数据

提示：不能将数据直接load进Hive所关联HBase的那张表中

CREATE TABLE emp(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int)
row format delimited fields terminated by '\t';

向Hive中间表中load数据

hive> load data local inpath '/home/vinx/data/emp.txt' into table emp;
通过insert命令将中间表中的数据导入到Hive关联HBase的那张表中

hive> insert into table hive_hbase_emp_table1 select * from emp;
查看Hive以及关联的HBase表中是否已经成功的同步插入了数据

Hive：

hive> select * from hive_hbase_emp_table;

HBase：

hbase> scan 'hbase_emp_table'

案例二：在HBase中已经存储了某一张表hbase_emp_table，然后在Hive中创建一个外部表来关联HBase中的hbase_emp_table这张表，使之可以借助Hive来分析HBase这张表中的数据。

提示：所以完成此案例前，需先完成案例1。

在Hive中创建外部表

CREATE EXTERNAL TABLE relevance_hbase_emp(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int)
STORED BY 
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = 
":key,info:ename,info:job,info:mgr,info:hiredate,info:sal,info:comm,info:deptno") 
TBLPROPERTIES ("hbase.table.name" = "hbase_emp_table1");

关联后就可以使用Hive函数进行一些分析操作了

hive (default)> select * from relevance_hbase_emp;

Sqoop集成：MySQL to HBase

Sqoop supports additional import targets beyond HDFS and Hive. Sqoop can also import records into a table in HBase.

相关参数：

--column-family<family>

Sets the target column family for the import
设置导入的目标列族。
--hbase-create-table

If specified, create missing HBase tables
是否自动创建不存在的HBase表（这就意味着，不需要手动提前在HBase中先建立表）
--hbase-row-key <col>

Specifies which input column to use as the row key.In case, if input table contains composite
key, then must be in the form of a comma-separated list of composite key attributes.
mysql中哪一列的值作为HBase的rowkey，如果rowkey是个组合键，则以逗号分隔。（注：避免rowkey的重复）
--hbase-table <table-name>

Specifies an HBase table to use as the target instead of HDFS.
指定数据将要导入到HBase中的哪张表中。
--hbase-bulkload

Enables bulk loading.
是否允许bulk形式的导入。

案例：将RDBMS中的数据抽取到HBase中。

配置sqoop-env.sh，添加如下内容：

export HBASE_HOME=/home/vinx/app/hbase-1.3.1

在MySQL中新建一个数据库db_library，一张表book

CREATE DATABASE db_library;

CREATE TABLE db_library.book(
id int(4) PRIMARY KEY NOT NULL AUTO_INCREMENT, 
name VARCHAR(255) NOT NULL, 
price VARCHAR(255) NOT NULL);

向表中插入一些数据

INSERT INTO db_library.book (name, price) VALUES('Lie Sporting', '30');

INSERT INTO db_library.book (name, price) VALUES('Pride & Prejudice', '70');

INSERT INTO db_library.book (name, price) VALUES('Fall of Giants', '50');

执行Sqoop导入数据的操作

手动创建HBase表：

hbase> create 'hbase_book','info'

在HBase中scan这张表得到如下内容

hbase> scan 'hbase_book'

$ bin/sqoop import \
--connect jdbc:mysql://hadoop001:3306/db_library \
--username root \
--password 123456 \
--table book \
--columns "id,name,price" \
--column-family "info" \
--hbase-create-table \
--hbase-row-key "id" \
--hbase-table "hbase_book" \
--num-mappers 1 \
--split-by id

提示：sqoop1.4.6只支持HBase1.0.1之前的版本的自动创建HBase表的功能。

HBase集成Phoenix

Phoenix介绍

可以把Phoenix理解为Hbase的查询引擎，phoenix，由saleforce.com开源的一个项目，后又捐给了Apache。它相当于一个Java中间件，帮助开发者，像使用jdbc访问关系型数据库一些，访问NoSql数据库HBase。

phoenix，操作的表及数据，存储在hbase上。phoenix只是需要和Hbase进行表关联起来，然后再用工具进行一些读或写操作。

其实，可以把Phoenix只看成一种代替HBase的语法的一个工具。虽然可以用java可以用jdbc来连接phoenix，然后操作HBase，但是在生产环境中，不可以用在OLTP中。在线事务处理的环境中，需要低延迟，而Phoenix在查询HBase时，虽然做了一些优化，但延迟还是不小。所以依然是用在OLAT中，再将结果返回存储下来。

Phoenix的使用

启动Phoenix

sqlline.py hadoop001:2181
展示表

> !table
创建表

> create table test(id integer not null primary key,name varchar);
删除表

drop table test;
插入数据

> upsert into test values(1,'andy');

> upsert into test(name) values('toms');
查询数据

phoenix > select * from test;

hbase > scan 'test'
退出Phoenix

> !q
删除数据

delete from test where id=4;
sum函数的使用

select sum(id) from test;
增加一列

alter table test add address varchar;
删除一列

alter table test drop column address;

表映射

hbase创建表：

create 'teacher','info','contact'

插入数据：

put 'teacher','1001','info:name','Jack'
put 'teacher','1001','info:age','28'
put 'teacher','1001','info:gender','male'
put 'teacher','1001','contact:address','shanghai'
put 'teacher','1001','contact:phone','13458646987'

put 'teacher','1002','info:name','Jim'
put 'teacher','1002','info:age','30'
put 'teacher','1002','info:gender','male'
put 'teacher','1002','contact:address','tianjian'
put 'teacher','1002','contact:phone','13512436987'

在Phoenix创建映射表：

create view "teacher"(
"ROW" varchar primary key,
"contact"."address" varchar,
"contact"."phone" varchar,
"info"."age"  varchar,
"info"."gender" varchar,
"info"."name" varchar
);

在Phoenix查找数据：

select * from "teacher";

节点的管理

服役（commissioning）

当启动regionserver时，regionserver会向HMaster注册并开始接收本地数据，开始的时候，新加入的节点不会有任何数据，平衡器开启的情况下，将会有新的region移动到开启的RegionServer上。如果启动和停止进程是使用ssh和HBase脚本，那么会将新添加的节点的主机名加入到conf/regionservers文件中。

$ ./bin/hbase-daemon.sh stop regionserver
hbase(main)>balance_switch true

退役（decommissioning）

顾名思义，就是从当前HBase集群中删除某个RegionServer，这个过程分为如下几个过程：

在0.90.2之前，我们只能通过在要卸载的节点上执行：

停止负载平衡器

hbase> balance_switch false
在退役节点上停止RegionServer

[vinx@hadoop001 hbase-1.3.1] hbase-daemon.sh stop regionserver
RegionServer一旦停止，会关闭维护的所有region
Zookeeper上的该RegionServer节点消失
Master节点检测到该RegionServer下线，开启平衡器

hbase> balance_switch true
下线的RegionServer的region服务得到重新分配

这种方法很大的一个缺点是该节点上的Region会离线很长时间。因为假如该RegionServer上有大量Region的话，因为Region的关闭是顺序执行的，第一个关闭的Region得等到和最后一个Region关闭并Assigned后一起上线。这是一个相当漫长的时间。每个Region Assigned需要4s，也就是说光Assigned就至少需要2个小时。该关闭方法比较传统，需要花费一定的时间，而且会造成部分region短暂的不可用。

另一种方案：自0.90.2之后，HBase添加了一个新的方法，即“graceful_stop”,只需要在HBase Master节点执行

Master节点执行

$ bin/graceful_stop.sh <RegionServer-hostname>

该命令会自动关闭Load Balancer，然后Assigned Region，之后会将该节点关闭。除此之外，你还可以查看remove的过程，已经assigned了多少个Region，还剩多少个Region，每个Region 的Assigned耗时
开启负载平衡器

hbase> balance_switch false

1	CREATE TABLE hive_hbase_emp_table1(
2	empno int,
3	ename string,
4	job string,
5	mgr int,
6	hiredate string,
7	sal double,
8	comm double,
9	deptno int)
10	STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
11	WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:ename,info:job,info:mgr,info:hiredate,info:sal,info:comm,info:deptno")
12	TBLPROPERTIES ("hbase.table.name" = "hbase_emp_table1");

1	CREATE TABLE emp(
2	empno int,
3	ename string,
4	job string,
5	mgr int,
6	hiredate string,
7	sal double,
8	comm double,
9	deptno int)
10	row format delimited fields terminated by '\t';

1	CREATE EXTERNAL TABLE relevance_hbase_emp(
2	empno int,
3	ename string,
4	job string,
5	mgr int,
6	hiredate string,
7	sal double,
8	comm double,
9	deptno int)
10	STORED BY
11	'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
12	WITH SERDEPROPERTIES ("hbase.columns.mapping" =
13	":key,info:ename,info:job,info:mgr,info:hiredate,info:sal,info:comm,info:deptno")
14	TBLPROPERTIES ("hbase.table.name" = "hbase_emp_table1");

1	CREATE DATABASE db_library;
2
3	CREATE TABLE db_library.book(
4	id int(4) PRIMARY KEY NOT NULL AUTO_INCREMENT,
5	name VARCHAR(255) NOT NULL,
6	price VARCHAR(255) NOT NULL);

1	$ bin/sqoop import \
2	--connect jdbc:mysql://hadoop001:3306/db_library \
3	--username root \
4	--password 123456 \
5	--table book \
6	--columns "id,name,price" \
7	--column-family "info" \
8	--hbase-create-table \
9	--hbase-row-key "id" \
10	--hbase-table "hbase_book" \
11	--num-mappers 1 \
12	--split-by id

1	create view "teacher"(
2	"ROW" varchar primary key,
3	"contact"."address" varchar,
4	"contact"."phone" varchar,
5	"info"."age" varchar,
6	"info"."gender" varchar,
7	"info"."name" varchar
8	);