Since the data is processed in parallel by hive, the splitting of the file in the default case would happen based at some size boundaries (depending on the block size in your cluster). So you can have lines with same value being processed by different map tasks, and your logic will not work in that case.
Secondly, the where clause is usually evaluated before the sort-by or order-by is evaluated.
Would use of a group-by clause or windowing function help solve your needs ?
↧
Reply To: Hive UDF function request problem
↧