Hypothetical situation : let us suppose I have a four node hadoop cluster – one name node and three data nodes.
I load a 3 gb data file and I am thinking that it will split the data amongst the three data nodes – say 1 gb each.
Then sometime later let us suppose that I get corruption errors and am able to narrow down the corrupt data to node 3 – the same file as loaded in the first step.
Let us suppose that I don’t have another copy of that file anywhere.
So how do i recover only that part of the data? That 1 gb on node 3?
Second hypothetical situation : instead of the 3 gb data file, let us suppose that the data file is 300 gb.
And same corruption happens only on node 3 which hosts 1/3rd of the data ie 100 gb.
And I HAVE the original data file.
So how do I load only that 100 gb data into Node 3 on HDFS?
So in short I am looking for solutions to address partial data node corruption issues.
Appreciate the insights.