思路是这样的:
把nginx日志数据清洗后导入avro表,创建分区表etl_nginx_avro,以month(202005)/day(01)进行分区,创建中间表tmp_etl_nginx_avro不分区,中间表比分区表多了两个字段,month和hour(其实是day创建时没留神);
中间表tmp_etl_nginx_avro:
分区表etl_nginx_avro:
自动分区:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table etl_nginx_avro partition(month,day)
select ip,acctime,locations,status,request_body,referer,useragent,osfamily,osname,uafamily,devicetype,month,hour from tmp_etl_nginx_avro;
然后分区字段:
但是分区文件夹变成这样:
中间表数据正常:
>select ip,acctime,locations,status,request_body,referer,useragent,osfamily,osname,uafamily,devicetype,month,hour from tmp_etl_nginx_avro limit 10;
182.254.52.17 20200503 /topics/809 200 - http://hainiubl.com/topics/809 Browser Windows Windows 7 Firefox unknown 202005 03
182.254.52.17 20200503 /topics/124?page%3D1 200 - http://hainiubl.com/topics/124?page%3D1 Browser Windows Windows 7 Firefox unknown 202005 03
自动分区完后 分区表数据正常:
> select * from etl_nginx_avro limit 10;
OK
123.4.226.100 20200501 /topics/44 200 - http://www.hainiubl.com/ Browser Windows Windows Chrome unknown 202005 01
123.4.226.100 20200501 /topics/181 301 - http://www.hainiubl.com/ Browser Windows Windows Chrome unknown 202005 01
123.4.226.100 20200501 /topics/181 200 - http://www.hainiubl.com/ Browser Windows Windows Chrome unknown 202005 01
123.4.226.100 20200501 /topics/182 301 - http://www.hainiubl.com/ Browser Windows Windows Chrome unknown 202005 01
123.4.226.100 20200501 /topics/182 200 - http://www.hainiubl.com/ Browser Windows Windows Chrome unknown 202005 01
就是这个分区名字有点辣眼睛,百思不得其解