Hi Tarjo,
I suppose that your cluster is only made of 2 datanodes…
If you write files with repfactor = 2, HDFS will always try to replicate blocks on distinct nodes. So it is normal that the used space on your 2 datanodes are almost the same…
The same will happen, if you did not set a special value (repfactor is set by default to 3).
Even if you rebalance data (using hdfs rebalancer), it will not change your situation as the distribution is driven here by the block assignation for securing data…
When node2 will be full (there is a threshold… I think that a datanode is considered as full when 80% is reached), I think that you cluster will probably continue to work but with under-replicated blocks (1 instead of 3). I did not make the test and you should check this point. Without replication, you will have no safety net if node1 fails (but according to your config, i don’t think it is a production cluster )
I think that your only chance for balancing data as you want is to set repfactor to 1 for all files, and made a rebalance… But it is very risky for your data will not be secured by replication.
Even with a repfactor =1, your data could be unbalanced… HDFS priorized writes on local disks. If your all data are generated from node1, all blocks will be located on node1 and none on node2 (but a posteriori balancing is possible in such case).
Regards