Purely out of curiosity, we observed that single row access was slower by many orders of magnitude when accessing an hbase table via the hive SQL interface.
We have a simple three column HBase table with integers between one and one million for the key. We wrote two classes:
hive that got a connection for each of ten threads
reused the same connection per thread (each had its own connection) to get random key values
average range between 200 and 500 milliseconds
hive that created an instance of HTable
using ten threads, get() calls for random keys between one and one million
average range between one and five milliseconds
As I said, out of curiosity, is the hive overhead that much more for a single row? We used the :key token when mapping hive to hbase, so I assumed it would use that. However, an explain on a query shows it does not.
0: jdbc:hive2://localhost:10000> explain select * from foo where rowkey = 1; +--------------------------------------------------------------------------------------------------+--+ | Explain | +--------------------------------------------------------------------------------------------------+--+ | STAGE DEPENDENCIES: | | Stage-0 is a root stage | | | | STAGE PLANS: | | Stage: Stage-0 | | Fetch Operator | | limit: -1 | | Processor Tree: | | TableScan | | alias: foo | | filterExpr: (rowkey = 1) (type: boolean) | | Statistics: Num rows: 1000001 Data size: 0 Basic stats: PARTIAL Column stats: NONE | | Filter Operator | | predicate: (rowkey = 1) (type: boolean) | | Statistics: Num rows: 500000 Data size: 0 Basic stats: PARTIAL Column stats: NONE | | Select Operator | Is the hive serde not "smart" enough to know it is key based access?