最近BA用户反馈有两句看似很像的语句返回的结果数不一样,比较奇怪,怀疑是不是Hive的Bug

Query 1 返回结果数6071

select count(distinct reviewid) as dis_reviewcnt
from
(select a.reviewid
from bi.dpods_dp_reviewreport a
left outer join bi.dpods_dp_reviewlog b
on a.reviewid=b.reviewid and b.hp_statdate='2013-07-24'
where to_date(a.feedadddate) >= '2013-07-01' and a.hp_statdate='2013-07-24'
) a

Query 2 返回结果数6443

select count(distinct reviewid) as dis_reviewcnt
from
(select a.reviewid
from bi.dpods_dp_reviewreport a
left outer join bi.dpods_dp_reviewlog b
on a.reviewid=b.reviewid and b.hp_statdate='2013-07-24' and a.hp_statdate='2013-07-24'
where to_date(a.feedadddate) >= '2013-07-01'
) a

第二条query比第一条多了372条数据,而且在子查询的左表中并不存在

两条语句唯一的区别是dpods_dp_reviewreport的分区过滤条件(hp_statdate是partition column)一个在where后面,另一个在on后面

粗看感觉出来的数据应该是一样的,但是玄机其实就在where和on的区别。

where 后面跟的是过滤条件,query 1 中的a.hp_statdate='2013-07-24', 在table scan之前就会Partition Pruner 过滤分区,所以只有'2013-07-24'下的数据会和dpods_dp_reviewlog进行join。

而query 2中会读入所有partition下的数据,再和dpods_dp_reviewlog join,并且根据join的关联条件只有a.hp_statdate='2013-07-24'的时候才会真正执行join,其余情况下又由于是left outer join, join不上右面会留NULL,query 2中其实是取出了所有的reviewid,所以会和query 1 结果不一样

可以做一个实验,query2去掉on后面的a.hp_statdate='2013-07-24',其余不动,执行语句,出来的distinct reviewcnt 也是 6443

select count(distinct reviewid) as dis_reviewcnt
from
(select a.reviewid
from bi.dpods_dp_reviewreport a
left outer join bi.dpods_dp_reviewlog b
on a.reviewid=b.reviewid and b.hp_statdate='2013-07-24'
where to_date(a.feedadddate) >= '2013-07-01'
) a

query 1的query plan

ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF (TOK_TABNAME bi dpods_dp_reviewreport) a) (TOK_TABREF (TOK_TABNAME bi dpods_dp_reviewlog) b) (and (= (. (TOK_TABLE_OR_COL a) reviewid) (. (TOK_TABLE_OR_COL b) reviewid)) (= (. (TOK_TABLE_OR_COL b) hp_statdate) '2013-07-24')))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) reviewid))) (TOK_WHERE (and (>= (TOK_FUNCTION to_date (. (TOK_TABLE_OR_COL a) feedadddate)) '2013-07-01') (= (. (TOK_TABLE_OR_COL a) hp_statdate) '2013-07-24'))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL reviewid)) dis_reviewcnt)))) STAGE DEPENDENCIES:
Stage-5 is a root stage , consists of Stage-1
Stage-1
Stage-2 depends on stages: Stage-1
Stage-0 is a root stage STAGE PLANS:
Stage: Stage-5
Conditional Operator Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a:a
TableScan
alias: a
Filter Operator
predicate:
expr: (to_date(feedadddate) >= '2013-07-01')
type: boolean
Reduce Output Operator
key expressions:
expr: reviewid
type: int
sort order: +
Map-reduce partition columns:
expr: reviewid
type: int
tag: 0
value expressions:
expr: feedadddate
type: string
expr: reviewid
type: int
expr: hp_statdate
type: string
a:b
TableScan
alias: b
Reduce Output Operator
key expressions:
expr: reviewid
type: int
sort order: +
Map-reduce partition columns:
expr: reviewid
type: int
tag: 1
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col5} {VALUE._col8} {VALUE._col17}
1
handleSkewJoin: false
outputColumnNames: _col5, _col8, _col17
Select Operator
expressions:
expr: _col8
type: int
outputColumnNames: _col0
Select Operator
expressions:
expr: _col0
type: int
outputColumnNames: _col0
Group By Operator
aggregations:
expr: count(DISTINCT _col0)
bucketGroup: false
keys:
expr: _col0
type: int
mode: hash
outputColumnNames: _col0, _col1
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
hdfs://10.2.6.102/tmp/hive-hadoop/hive_2013-07-26_18-10-59_408_7272696604651905662/-mr-10002
Reduce Output Operator
key expressions:
expr: _col0
type: int
sort order: +
tag: -1
value expressions:
expr: _col1
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(DISTINCT KEY._col0:0._col0)
bucketGroup: false
mode: mergepartial
outputColumnNames: _col0
Select Operator
expressions:
expr: _col0
type: bigint
outputColumnNames: _col0
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0
Fetch Operator
limit: -1

Query 2的query plan

ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF (TOK_TABNAME bi dpods_dp_reviewreport) a) (TOK_TABREF (TOK_TABNAME bi dpods_dp_reviewlog) b) (and (and (= (. (TOK_TABLE_OR_COL a) reviewid) (. (TOK_TABLE_OR_COL b) reviewid)) (= (. (TOK_TABLE_OR_COL b) hp_statdate) '2013-07-24')) (= (. (TOK_TABLE_OR_COL a) hp_statdate) '2013-07-24')))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) reviewid))) (TOK_WHERE (>= (TOK_FUNCTION to_date (. (TOK_TABLE_OR_COL a) feedadddate)) '2013-07-01')))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL reviewid)) dis_reviewcnt)))) STAGE DEPENDENCIES:
Stage-5 is a root stage , consists of Stage-1
Stage-1
Stage-2 depends on stages: Stage-1
Stage-0 is a root stage STAGE PLANS:
Stage: Stage-5
Conditional Operator Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a:a
TableScan
alias: a
Filter Operator
predicate:
expr: (to_date(feedadddate) >= '2013-07-01')
type: boolean
Reduce Output Operator
key expressions:
expr: reviewid
type: int
sort order: +
Map-reduce partition columns:
expr: reviewid
type: int
tag: 0
value expressions:
expr: feedadddate
type: string
expr: reviewid
type: int
expr: hp_statdate
type: string
a:b
TableScan
alias: b
Reduce Output Operator
key expressions:
expr: reviewid
type: int
sort order: +
Map-reduce partition columns:
expr: reviewid
type: int
tag: 1
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col5} {VALUE._col8}
1
filter predicates:
0 {(VALUE._col17 = '2013-07-24')}
1
handleSkewJoin: false
outputColumnNames: _col5, _col8
Select Operator
expressions:
expr: _col8
type: int
outputColumnNames: _col0
Select Operator
expressions:
expr: _col0
type: int
outputColumnNames: _col0
Group By Operator
aggregations:
expr: count(DISTINCT _col0)
bucketGroup: false
keys:
expr: _col0
type: int
mode: hash
outputColumnNames: _col0, _col1
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
hdfs://10.2.6.102/tmp/hive-hadoop/hive_2013-07-26_18-13-32_879_3623450294049807419/-mr-10002
Reduce Output Operator
key expressions:
expr: _col0
type: int
sort order: +
tag: -1
value expressions:
expr: _col1
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(DISTINCT KEY._col0:0._col0)
bucketGroup: false
mode: mergepartial
outputColumnNames: _col0
Select Operator
expressions:
expr: _col0
type: bigint
outputColumnNames: _col0
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0
Fetch Operator
limit: -1

参考:

http://blog.sina.com.cn/s/blog_6ff05a2c01010oxp.html

hive left outer join的问题的更多相关文章

  1. HIVE中join、semi join、outer join举例详解

    转自 http://www.cnblogs.com/xd502djj/archive/2013/01/18/2866662.html 举例子: hive> select * from zz0;  ...

  2. hive中left join、left outer join和left semi join的区别

    先说结论,再举例子.   hive中,left join与left outer join等价.   left semi join与left outer join的区别:left semi join相当 ...

  3. HIVE中join、semi join、outer join

    补充说明 left outer join where is not null与left semi join的联系与区别:两者均可实现exists in操作,不同的是,前者允许右表的字段在select或 ...

  4. hive 包含操作(left semi join)(left outer join = in)迪卡尔积

    目前hive不支持 in或not in 中包含查询子句的语法,所以只能通过left join实现. 假设有一个登陆表login(当天登陆记录,只有一个uid),和一个用户注册表regusers(当天注 ...

  5. hive regex insert join group cli

    1.insert Insert时,from子句既能够放在select子句后,也能够放在insert子句前,以下两句是等价的 hive> FROM invites a INSERT OVERWRI ...

  6. 一起学Hive——总结各种Join连接的用法

    Hive支持常用的SQL join语句,例如内连接.左外连接.右外连接以及HiVe独有的map端连接.其中map端连接是用于优化Hive连接查询的一个重要技巧. 在介绍各种连接之前,先准备好表和数据. ...

  7. hive中的join

    建表 : jdbc:hive2://localhost:10000> create database myjoin; No rows affected (3.78 seconds) : jdbc ...

  8. Oracle Partition Outer Join 稠化报表

    partition outer join实现将稀疏数据转为稠密数据,举例: with t as (select deptno, job, sum(sal) sum_sal from emp group ...

  9. SQL Server 2008 R2——使用FULL OUTER JOIN实现多表信息汇总

    =================================版权声明================================= 版权声明:原创文章 谢绝转载  请通过右侧公告中的“联系邮 ...

随机推荐

  1. Spring 之注解事务 @Transactional

    众所周知的ACID属性:  原子性(atomicity).一致性(consistency).隔离性(isolation)以及持久性(durability).我们无法控制一致性.原子性以及持久性,但可以 ...

  2. Android开发中的Json字符串与复杂的嵌套对象互转。

    Gson 可能是大家都觉得比较简单吧.我发现用JSONObject和网上下载的JSONHelper类使用起来很无语,只能解析简单的单层对象,如果有嵌套的就不能直转转成可用对象了.所以网上找了一会儿,发 ...

  3. java 的几种引用

    从JDK1.2版本开始,把对象的引用分为四种级别,从而使程序能更加灵活的控制对象的生命周期.这四种级别由高到低依次为:强引用.软引用.弱引用和虚引用. 1.强引用 本章前文介绍的引用实际上都是强引用, ...

  4. 命令行备忘录 cli-memo

    前言 有时候想用一个简洁点儿的备忘录,发现没有简洁好用的,于是就想着开发一个,秉着简洁 的思想,所以连界面都没有,只能通过命令行来操作(尽可能的将命令简化).设计的时候 借鉴了git分支的思想,每个备 ...

  5. grep的正则表达式结合的几个典型应用

    一 几个特殊的字符: ^ :只匹配行首  如^a 匹配以a开头的行abc,a2e,a12,aaa,...... example: grep "^a" //列出所有以a开头的行 $ ...

  6. vscode打开django项目pylint提示has not "object" member

    vscode 打开 django 项目提示 has not "object" member 是因为 Django 动态地将属性添加到所有模型类中,所以 ide 无法解析. 解决方案 ...

  7. PDF 补丁丁 0.6.0.3326 版发布(修复提取图片的问题)

    新的 PDF 补丁丁已经发布. 新版本更新了 PDF 渲染引擎. 另外修复了网友提出的提取图片功能中的两个问题.

  8. svn+http+ad域

    svn本地添加用户太麻烦了,如果公司有一百个开发人员要访问,要创建账号密码太麻烦了:所以让他们用AD域账号去登录就很方便,但是权限的管控还是在svn的本地添加(这个暂时还没办法很好的解决) 一.安装依 ...

  9. Js 中对 Json 数组的常用操作

    我们首先定义一个json数组对象如下: var persons = [ {name: "tina", age: 14}, {name: "timo", age: ...

  10. auto semicolon insertion 自动分号补齐的坑

    今天发现js自动分号补齐的坑,来看如下两段代码: function Hello(){ return { name: ’JavaScript’ }; } alert(Hello()); //输出unde ...