Hi
I meet a strange behavior on my cluster…
I launched a hive task that failed with the following error :
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/hive/XXXXXXXXXXXXXXXXXXX could only be replicated to 0 nodes instead of minReplication (=1). There are 3 datanode(s) running and no node(s) are excluded in this operation.
After investigation, my task created temporary files on HDFS that does not seem to be correctly freed.
On Ambari dashboard and with dfsadmin report, I see that I have a huge volume of non dfs data used. For instance :
Name: 192.168.3.37:50010 (hadoop-05.XXXXXXl)
Hostname: hadoop-05.XXXXXX
Decommission Status : Normal
Configured Capacity: 158397865984 (147.52 GB)
DFS Used: 19235680256 (17.91 GB)
Non DFS Used: 58176096493 (54.18 GB)
DFS Remaining: 80986089235 (75.42 GB)
DFS Used%: 12.14%
DFS Remaining%: 51.13%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 3
Last contact: Tue Mar 24 08:47:34 CET 2015
… but I did not find this non dfs space used on my system (same with du, df, …) :
[root@hadoop-05 hadoop]# df -h /data/
Filesystem Size Used Avail Use% Mounted on
/dev/sdb1 148G 19G 122G 14% /data
If I restart the datanode, non dfs is this time freed… but if I run again my task, the same error and symptoms happen.
This reminds me behavior I met in the past on some filesystems when an opened file was deleted… “df” and “du” were then returning different values.
Any idea ?