Quantcast
Channel: Hortonworks » All Replies
Viewing all articles
Browse latest Browse all 3435

Simple Hive query becoming slow with a lot of data

$
0
0

Hello,
Im new with Hive and im wondering about the performance of a i would say simple hive query. Im using Ambari 1.6.1 and the 2.1 stack so not an old Hive.

i have 2 different big data folders (15GB and 100GB) but its the same type of files and the exact same tables and querys. My query looks like :

select id, count(time), avg(speed) from noexternal group by id;

So i have a count and a avg calculation and a group by. I have multiple entries for every id and i calculate the time and avg speed as you can see. I also implemented this with MapReduce using a Combiner since the output is really small compared to the input (around 50mb). So when im querying the 15GB it runs pretty fast about the same time as my MapReduce Algorithm with Combiner.

But when it comes to the bigger folder it takes much longer. My Mapreduce Algorithm needs about 12 mins but hive needs like 45 minutes. I can see that theres no combiner used since it takes a long time to reduce and hive spawns a lot of reducers. Now im wondering isnt hive supposed to use something like a combiner? and how can the small query be so fast compared to mr and the big so slow?

Im wondering is there like an automatic index which works with the small data set but not with the big one?

Would be very nice if someone knows some more about the Combiner / Hive thing and the automatic index or whats the reason about this.


Viewing all articles
Browse latest Browse all 3435

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>