参考网址:http://shiyanjun.cn/archives/588.html
在hive执行sql过程中,一条sql执行了一个多小时,检查sql发现,在left
join之后,on只使用了关联字段,而其他筛选条件都在where之中使用。经过修改后,将where字段放在on
条件中进行判断,执行效率提高到半小时之内结束。在on条件中使用条件,可以发现筛选分区后再进行关联操作,splits只有12个
INFO : number of splits:12
,而where中使用条件,则是关联之后再进行分区筛选,分片数达到了94个splits,而且需要执行94个map,412个reduce,消耗的资源是非常巨大的。
INFO : number of splits:94
原始hql:
优化后hql:
查看执行计划发现,on中使用分区字段过滤,则执行计划中在scan表的时候就已经过滤掉分区了,而如果在where条件中过滤,则不会在scan中过滤,输入数据量巨大!
------------------------------------执行过程日志如下:
INFO : Query ID =
hive_20171026031919_9b1b59f2-b867-466a-95a5-06b5be6daafa
INFO : Total jobs = 5
INFO : Starting task [Stage-12:MAPREDLOCAL] in
serial mode
INFO : Execution completed successfully
INFO : MapredLocal task succeeded
INFO : Launching Job 1 out of 5
INFO : Starting task [Stage-10:MAPRED] in
serial mode
INFO : Number of reduce tasks is set to 0
since there's no reduce operator
INFO : number of splits:12
INFO : Submitting tokens for job:
job_1508725411753_0155
INFO : Kind: HDFS_DELEGATION_TOKEN, Service:
10.1.2.94:8020, Ident: (HDFS_DELEGATION_TOKEN token 999 for
hive)
INFO : The url to track the job:
http://host:8088/proxy/application_1508725411753_0155/
INFO : Starting Job = job_1508725411753_0155,
Tracking URL =
http://host:8088/proxy/application_1508725411753_0155/
INFO : Kill Command =
/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hadoop/bin/hadoop
job -kill job_1508725411753_0155
INFO : Hadoop job information for Stage-10:
number of mappers: 12; number of reducers: 0
.........
INFO : 2017-10-26 03:21:44,169
Stage-10 map = 100%, reduce = 0%, Cumulative CPU
176.25 sec
INFO : MapReduce Total cumulative CPU time: 2
minutes 56 seconds 250 msec
INFO : Ended Job = job_1508725411753_0155
INFO : Starting task [Stage-9:CONDITIONAL] in
serial mode
INFO : Stage-11 is filtered out by condition
resolver.
INFO : Stage-2 is selected by condition
resolver.
INFO : Launching Job 2 out of 5
INFO : Starting task [Stage-2:MAPRED] in
serial mode
INFO : Number of reduce tasks not specified.
Estimated from input data size: 12
INFO : In order to change the average load for
a reducer (in bytes):
INFO : set
hive.exec.reducers.bytes.per.reducer=
INFO : In order to limit the maximum number of
reducers:
INFO : set
hive.exec.reducers.max=
INFO : In order to set a constant number of
reducers:
INFO : set
mapreduce.job.reduces=
INFO : number of splits:4
INFO : Submitting tokens for job:
job_1508725411753_0156
INFO : Kind: HDFS_DELEGATION_TOKEN, Service:
10.1.2.94:8020, Ident: (HDFS_DELEGATION_TOKEN token 1000 for
hive)
INFO : The url to track the job:
http://host:8088/proxy/application_1508725411753_0156/
INFO : Starting Job = job_1508725411753_0156,
Tracking URL =
http://host:8088/proxy/application_1508725411753_0156/
INFO : Kill Command =
/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hadoop/bin/hadoop
job -kill job_1508725411753_0156
INFO : Hadoop job information for Stage-2:
number of mappers: 4; number of reducers: 12
INFO : 2017-10-26 03:21:56,128 Stage-2 map =
0%, reduce = 0%
.........
INFO : 2017-10-26 03:27:43,452
Stage-2 map = 100%, reduce = 100%, Cumulative CPU
981.83 sec
INFO : MapReduce Total cumulative CPU time: 16
minutes 21 seconds 830 msec
INFO : Ended Job = job_1508725411753_0156
INFO : Launching Job 3 out of 5
INFO : Starting task [Stage-3:MAPRED] in
serial mode
INFO : Number of reduce tasks not specified.
Estimated from input data size: 1
INFO : In order to change the average load for
a reducer (in bytes):
INFO : set
hive.exec.reducers.bytes.per.reducer=
INFO : In order to limit the maximum number of
reducers:
INFO : set
hive.exec.reducers.max=
INFO : In order to set a constant number of
reducers:
INFO : set
mapreduce.job.reduces=
INFO : number of splits:1
INFO : Submitting tokens for job:
job_1508725411753_0158
INFO : Kind: HDFS_DELEGATION_TOKEN, Service:
10.1.2.94:8020, Ident: (HDFS_DELEGATION_TOKEN token 1002 for
hive)
INFO : The url to track the job:
http://host:8088/proxy/application_1508725411753_0158/
INFO : Starting Job = job_1508725411753_0158,
Tracking URL =
http://host:8088/proxy/application_1508725411753_0158/
INFO : Kill Command =
/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hadoop/bin/hadoop
job -kill job_1508725411753_0158
INFO : Hadoop job information for Stage-3:
number of mappers: 1; number of reducers: 1
INFO : 2017-10-26 03:28:00,533 Stage-3 map =
0%, reduce = 0%
INFO : 2017-10-26 03:28:13,143 Stage-3 map =
100%, reduce = 0%, Cumulative CPU 6.57 sec
INFO : 2017-10-26 03:28:28,700 Stage-3 map =
100%, reduce = 100%, Cumulative CPU 15.46 sec
INFO : MapReduce Total cumulative CPU time: 15
seconds 460 msec
INFO : Ended Job = job_1508725411753_0158
INFO : Launching Job 4 out of 5
INFO : Starting task [Stage-4:MAPRED] in
serial mode
INFO : Number of reduce tasks determined at
compile time: 1
INFO : In order to change the average load for
a reducer (in bytes):
INFO : set
hive.exec.reducers.bytes.per.reducer=
INFO : In order to limit the maximum number of
reducers:
INFO : set
hive.exec.reducers.max=
INFO : In order to set a constant number of
reducers:
INFO : set
mapreduce.job.reduces=
INFO : number of splits:1
INFO : Submitting tokens for job:
job_1508725411753_0159
INFO : Kind: HDFS_DELEGATION_TOKEN, Service:
10.1.2.94:8020, Ident: (HDFS_DELEGATION_TOKEN token 1004 for
hive)
INFO : The url to track the job:
http://host:8088/proxy/application_1508725411753_0159/
INFO : Starting Job = job_1508725411753_0159,
Tracking URL =
http://host:8088/proxy/application_1508725411753_0159/
INFO : Kill Command =
/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hadoop/bin/hadoop
job -kill job_1508725411753_0159
INFO : Hadoop job information for Stage-4:
number of mappers: 1; number of reducers: 1
INFO : 2017-10-26 03:28:42,569 Stage-4 map =
0%, reduce = 0%
INFO : 2017-10-26 03:28:56,012 Stage-4 map =
100%, reduce = 0%, Cumulative CPU 7.64 sec
INFO : 2017-10-26 03:29:09,498 Stage-4 map =
100%, reduce = 100%, Cumulative CPU 13.13 sec
INFO : MapReduce Total cumulative CPU time: 13
seconds 130 msec
INFO : Ended Job = job_1508725411753_0159
INFO : Starting task [Stage-0:MOVE] in serial
mode
INFO : Loading data to table otl.oel_test from
hdfs://test94.eformax.com:8020/user/hive/warehouse/otl.db/oel_test/.hive-staging_hive_2017-10-26_03-19-40_941_4069260026444954348-5/-ext-10000
INFO : Starting task [Stage-5:STATS] in serial
mode
INFO : Table otl.oel_test stats: [numFiles=1,
numRows=188992, totalSize=15418476, rawDataSize=15229484]
INFO : MapReduce Jobs
Launched:
INFO : Stage-Stage-10: Map:
12 Cumulative CPU: 176.25
sec HDFS Read: 3308304990 HDFS
Write: 500780832 SUCCESS
INFO : Stage-Stage-2: Map: 4
Reduce: 12 Cumulative CPU:
981.83 sec HDFS Read: 802041177
HDFS Write: 15495161 SUCCESS
INFO : Stage-Stage-3: Map: 1
Reduce: 1 Cumulative CPU: 15.46
sec HDFS Read: 15527532 HDFS
Write: 17183487 SUCCESS
INFO : Stage-Stage-4: Map: 1
Reduce: 1 Cumulative CPU: 13.13
sec HDFS Read: 17193170 HDFS
Write: 15418554 SUCCESS
INFO : Total MapReduce CPU Time Spent: 19
minutes 46 seconds 670 msec
INFO : Completed executing
command(queryId=hive_20171026031919_9b1b59f2-b867-466a-95a5-06b5be6daafa);
Time taken: 565.401 seconds
INFO : OK