Hive实现WordCount程序

Hive实现WordCount

Hive的底层执行引擎是MapReduce，既然MR程序可以实现单词统计，那么Hive也可以，而且相对于MR编程实现来说更容易，只需要编写SQL语句即可。

数据准备

创建数据库：

create database wordcount;

切换至wordcount数据库：

use wordcount;

建表：

create external table lines(
line string
)
row format delimited fields terminated by '\n'
stored as textfile
location '/hive_wordcount';

加载数据：

1	load data local inpath '/home/vinx/data/hive_wordcount/words.txt'
2	into table lines;

查看表中数据：

1	select * from lines;
2	+----------------------------------+--+
3	\| lines.line \|
4	+----------------------------------+--+
5	\| hello world \|
6	\| hello hadoop \|
7	\| hello hive \|
8	\| this is hive word count example \|
9	+----------------------------------+--+
10	4 rows selected (0.162 seconds)

拆分单词

使用Hive内置的UDTF函数explode(array)，实现行转列：

-- 先创建表words以存放拆分后的单词
create table words(
word string
)
row format delimited fields terminated by '\t';

-- 切分句子
insert into table words
select explode(split(line, " ")) as word 
from lines;

查看words中拆分后的数据：

select * from words;
+-------------+--+
| words.word  |
+-------------+--+
| hello       |
| world       |
| hello       |
| hadoop      |
| hello       |
| hive        |
| this        |
| is          |
| hive        |
| word        |
| count       |
| example     |
+-------------+--+
12 rows selected (0.128 seconds)

实现WordCount

使用group by实现Hive的wordcount：

1	select word,count(1) as num from words group by word;
2	+----------+------+--+
3	\| word \| num \|
4	+----------+------+--+
5	\| count \| 1 \|
6	\| example \| 1 \|
7	\| hadoop \| 1 \|
8	\| hello \| 3 \|
9	\| hive \| 2 \|
10	\| is \| 1 \|
11	\| this \| 1 \|
12	\| word \| 1 \|
13	\| world \| 1 \|
14	+----------+------+--+
15	9 rows selected (34.373 seconds)

思路和MapReduce中的WordCount是一样的：先拆分成一个个的单词，然后对单词进行分组，最后累加。

上面拆分单词的过程也可以结合lateral view和explode()使用：

1	select word
2	from lines
3	lateral view explode(split(line, " ")) tmp as word;

lateral view一般与UDTF函数（如explode()）结合使用，将UDTF应用于基表的每一行。

1	create external table lines(
2	line string
3	)
4	row format delimited fields terminated by '\n'
5	stored as textfile
6	location '/hive_wordcount';

1	-- 先创建表words以存放拆分后的单词
2	create table words(
3	word string
4	)
5	row format delimited fields terminated by '\t';
6
7	-- 切分句子
8	insert into table words
9	select explode(split(line, " ")) as word
10	from lines;

1	select * from words;
2	+-------------+--+
3	\| words.word \|
4	+-------------+--+
5	\| hello \|
6	\| world \|
7	\| hello \|
8	\| hadoop \|
9	\| hello \|
10	\| hive \|
11	\| this \|
12	\| is \|
13	\| hive \|
14	\| word \|
15	\| count \|
16	\| example \|
17	+-------------+--+
18	12 rows selected (0.128 seconds)