Hive实现WordCount程序

Hive实现WordCount

Hive的底层执行引擎是MapReduce,既然MR程序可以实现单词统计,那么Hive也可以,而且相对于MR编程实现来说更容易,只需要编写SQL语句即可。

数据准备

创建数据库:

create database wordcount;

切换至wordcount数据库:

use wordcount;

建表:

1
create external table lines(
2
line string
3
)
4
row format delimited fields terminated by '\n'
5
stored as textfile
6
location '/hive_wordcount';

加载数据:

1
load data local inpath '/home/vinx/data/hive_wordcount/words.txt'
2
into table lines;

查看表中数据:

1
select * from lines;
2
+----------------------------------+--+
3
|            lines.line            |
4
+----------------------------------+--+
5
| hello world                      |
6
| hello hadoop                     |
7
| hello hive                       |
8
| this is hive word count example  |
9
+----------------------------------+--+
10
4 rows selected (0.162 seconds)

拆分单词

使用Hive内置的UDTF函数explode(array),实现行转列:

1
-- 先创建表words以存放拆分后的单词
2
create table words(
3
word string
4
)
5
row format delimited fields terminated by '\t';
6
7
-- 切分句子
8
insert into table words
9
select explode(split(line, " ")) as word 
10
from lines;

查看words中拆分后的数据:

1
select * from words;
2
+-------------+--+
3
| words.word  |
4
+-------------+--+
5
| hello       |
6
| world       |
7
| hello       |
8
| hadoop      |
9
| hello       |
10
| hive        |
11
| this        |
12
| is          |
13
| hive        |
14
| word        |
15
| count       |
16
| example     |
17
+-------------+--+
18
12 rows selected (0.128 seconds)

实现WordCount

使用group by实现Hive的wordcount:

1
select word,count(1) as num from words group by word;
2
+----------+------+--+
3
|   word   | num  |
4
+----------+------+--+
5
| count    | 1    |
6
| example  | 1    |
7
| hadoop   | 1    |
8
| hello    | 3    |
9
| hive     | 2    |
10
| is       | 1    |
11
| this     | 1    |
12
| word     | 1    |
13
| world    | 1    |
14
+----------+------+--+
15
9 rows selected (34.373 seconds)

思路和MapReduce中的WordCount是一样的:先拆分成一个个的单词,然后对单词进行分组,最后累加。

上面拆分单词的过程也可以结合lateral viewexplode()使用:

1
select word 
2
from lines
3
lateral view explode(split(line, " ")) tmp as word;

lateral view一般与UDTF函数(如explode())结合使用,将UDTF应用于基表的每一行。

Author: VinxC
Link: https://vinxikk.github.io/2018/06/23/hive/hive-wordcount/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.