I have Hive table in compressed (snappy) Avro format (42Gb):
CREATE EXTERNAL TABLE avday_ind
PARTITIONED BY (CALDAY string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’
STORED AS
INPUTFORMAT ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat’
OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’
LOCATION ‘/user/hive/target/calday’
TBLPROPERTIES (‘avro.schema.url’=’hdfs:///user/hive/schema.avsc’);
This query is runnig 37-40 minutes:
select tggrp_id, calday, part
from avday_ind
where tggrp_id in (
‘3640B59A1A7F1ED3B18F5FC67E8C6E4D’,
‘3640B59A1A7F1ED3B18F5FC67E8C8E4D’,
….
….
‘3640B59A1A7F1ED3B18F6F68D8D94Q22′);
I created index for my table:
CREATE INDEX index_avday ON TABLE avday_ind (tggrp_id)
AS ‘compact’ WITH DEFERRED REBUILD;
ALTER INDEX index_avday
ON avday_ind REBUILD;
But the query again is running 37-40 mins. I expect to see acceleration. Why it did not happen?
Hive 0.14
hive.optimize.index.filter=true
Explain plan:
Stage: Stage-1
Tez
DagName: hive_20150813112929_1dd931ab-1e2f-4cc7-b300-87c944f58e51:100
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: avday_ind
filterExpr: (tggrp_id) IN (‘3640B59A1A7F1ED3B18F5FC67E8C6E4D’, ‘3640B59A1A7F1ED3B18F5FC67E8C8E4D’, ‘3640B59A1A7F1ED3B18F655C32B30E5D’, ‘3640B59A1A7F1ED3B18F655C32B32E5D’,
….
….
’40F2E92F56741EE4AB8A28426F650A62′, ’40F2E92F56741EE4AB8A289F63A28A62′, ‘005056BF5C2E1EE4BE936FA9CA519688′, ‘005056BF5C2E1EE4BE936FCE51989688′) (type: boolean)
Statistics: Num rows: 211287050 Data size: 42257427484 Basic stats: COMPLETE Column stats: PARTIAL
Filter Operator
predicate: (tggrp_id) IN (‘3640B59A1A7F1ED3B18F5FC67E8C6E4D’, ‘3640B59A1A7F1ED3B18F5FC67E8C8E4D’, ‘3640B59A1A7F1ED3B18F655C32B30E5D’, ‘3640B59A1A7F1ED3B18F655C32B32E5D’,
…
…
‘005056BF5C2E1EE4BE936FA9CA519688′, ‘005056BF5C2E1EE4BE936FCE51989688′) (type: boolean)
Statistics: Num rows: 105643525 Data size: 19438408600 Basic stats: COMPLETE Column stats: PARTIAL
Select Operator
expressions: tggrp_id (type: string), calday (type: string), bpartner (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 105643525 Data size: 19438408600 Basic stats: COMPLETE Column stats: PARTIAL
File Output Operator
compressed: false
Statistics: Num rows: 105643525 Data size: 19438408600 Basic stats: COMPLETE Column stats: PARTIAL
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink