Hi to everyone,
I would like to know whether a problem I am facing with Apache Hive in Azure HDInsight occurs in Hortonworks as well. As follow is the explanation:
The size of the table doubles if I load the data with INSERT OVERWRITE vs LOAD, more specifically:
I created a table “item”. Loaded the data from item.dat (aprox 28MB). After that what happens is that the file item.dat will be moved to hive/warehouse and off course the size remains the same
Now if I create another table “item2″ same as item and then load the data from item to item2 with the following command:
INSERT OVERWRITE TABLE item2 SELECT * FROM item
the size of table item2 is double of item (aprox 55MB)
Why does this happen? And is there any way to avoid it?
And the situation escalates as the size of the data grows.
ps. this is only to illustrate the problem. In practice I am interested for pre-joining tables but INSERT OVERWRITE increases the size of the joined table drastically (Actual problem: 4GB joined with 28MB gives 18GB)
Thank you!