Hive实现WordCount
Hive的底层执行引擎是MapReduce,既然MR程序可以实现单词统计,那么Hive也可以,而且相对于MR编程实现来说更容易,只需要编写SQL语句即可。
数据准备
创建数据库:
create database wordcount;
切换至wordcount数据库:
use wordcount;
建表:
1 | create external table lines( |
2 | line string |
3 | ) |
4 | row format delimited fields terminated by '\n' |
5 | stored as textfile |
6 | location '/hive_wordcount'; |
加载数据:
1 | load data local inpath '/home/vinx/data/hive_wordcount/words.txt' |
2 | into table lines; |
查看表中数据:
1 | select * from lines; |
2 | +----------------------------------+--+ |
3 | | lines.line | |
4 | +----------------------------------+--+ |
5 | | hello world | |
6 | | hello hadoop | |
7 | | hello hive | |
8 | | this is hive word count example | |
9 | +----------------------------------+--+ |
10 | 4 rows selected (0.162 seconds) |
拆分单词
使用Hive内置的UDTF函数explode(array)
,实现行转列:
1 | -- 先创建表words以存放拆分后的单词 |
2 | create table words( |
3 | word string |
4 | ) |
5 | row format delimited fields terminated by '\t'; |
6 | |
7 | -- 切分句子 |
8 | insert into table words |
9 | select explode(split(line, " ")) as word |
10 | from lines; |
查看words中拆分后的数据:
1 | select * from words; |
2 | +-------------+--+ |
3 | | words.word | |
4 | +-------------+--+ |
5 | | hello | |
6 | | world | |
7 | | hello | |
8 | | hadoop | |
9 | | hello | |
10 | | hive | |
11 | | this | |
12 | | is | |
13 | | hive | |
14 | | word | |
15 | | count | |
16 | | example | |
17 | +-------------+--+ |
18 | 12 rows selected (0.128 seconds) |
实现WordCount
使用group by
实现Hive的wordcount:
1 | select word,count(1) as num from words group by word; |
2 | +----------+------+--+ |
3 | | word | num | |
4 | +----------+------+--+ |
5 | | count | 1 | |
6 | | example | 1 | |
7 | | hadoop | 1 | |
8 | | hello | 3 | |
9 | | hive | 2 | |
10 | | is | 1 | |
11 | | this | 1 | |
12 | | word | 1 | |
13 | | world | 1 | |
14 | +----------+------+--+ |
15 | 9 rows selected (34.373 seconds) |
思路和MapReduce中的WordCount是一样的:先拆分成一个个的单词,然后对单词进行分组,最后累加。
上面拆分单词的过程也可以结合lateral view
和explode()
使用:
1 | select word |
2 | from lines |
3 | lateral view explode(split(line, " ")) tmp as word; |
lateral view
一般与UDTF函数(如explode()
)结合使用,将UDTF应用于基表的每一行。