Quantcast
Channel: Hortonworks » All Replies
Viewing all articles
Browse latest Browse all 3435

Index optimization

$
0
0

Hi. I am attempting to optimize a query using indexing. My current query converts an ipv4 address to a country using a geolocation table. However, the geolocation table is fairly large and the query takes an impractical amount of time. I have created indexes and set the binary search parameter to true (default), but the query is not faster. Note that I am using Tez as the execution engine.

Here is how I set up indexing:

set hive.optimize.index.filter=true;

DROP INDEX IF EXISTS ipv4indexes ON ipv4geotable;
CREATE INDEX ipv4indexes
ON TABLE ipv4geotable (StartIp, EndIp)
AS ‘org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler’
WITH DEFERRED REBUILD
IDXPROPERTIES (‘hive.index.compact.binary.search’=’true’);
ALTER INDEX ipv4indexes ON ipv4geotable REBUILD;

And here is my query:

DROP TABLE IF EXISTS ipv4table;
CREATE TABLE ipv4table AS
SELECT logon.IP, ipv4.Country
FROM
(SELECT * FROM logontable WHERE isIpv4(IP)) logon
LEFT OUTER JOIN
(SELECT StartIp, EndIp, Country FROM ipv4geotable) ipv4 ON isIpv4(logon.IP)
WHERE ipv4.StartIp <= logon.IP AND logon.IP <= ipv4.EndIp;

What the query is doing is extracting an IP from logontable and finding in which range it lies within the geolocation table (which is sorted). When a range is found, the corresponding country is returned. I suspect that Hive goes through the whole table row by row rather than performing a smart search (ex: binary search).

Any suggestions on how to speed things up? Thanks!


Viewing all articles
Browse latest Browse all 3435

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>