I went through a Namenode HA migration with Ambari 1.6.1 yesterday and wanted to share my experiences. I saw there was an Ambari feature to migrate HA NN, and several bugs were fixed in prior ambari versions, and thought “easy; I’m only moving my standby NN, no user downtime”.
tl;dr: HA migration is buggy; plan for downtime!
Started with: Namenodes in HA, standby NN on server A1, active NN on server A2. Goal was to end up with a NN in server A3 and A2 (removing it from A1).
Went through the Ambari wizard to migrate NNs. Set new NNs as A2 and A3. Everything worked up until “restarting services”, at which point it failed because it couldn’t “bind to address” when starting the NN on A2.
– I debugged it and the conf on A2 was showing nn1=A3 and nn2=A3 (both A3).
– Changing hdfs-site.xml doesn’t work (as ambari redeploys it’s version).
– Edited the config in the database. There’s not a lot of docs (any docs) on this, but what I did was add a new field in the clusterconfig table with a new version number, , type of hdfs-site, with the corrected config. Then also updated clusterconfigmapping to insert a new entry to make that version line. After making these changes, then AA2 NN was able to start up.
2. Next steps were to formatZK / boostrapStandby on the NNs.
– formatZK worked fine, but -boostrapStandby resulted in “invalid last txid in stream” errors. I failed over the QJM but that didn’t help (all the QJMs were internally consistent; good). I didn’t find any useful information on how to fix this error (if anyone knows how, please let me know).
At this point, the NN on A3 wasn’t operational, though I could access HDFS w/datanodes running.
3. I decided “turn off NN HA” (to tear down and re-initializing the QJMs). I went through the ambari “disable HA” which worked reasonable well. I skipped the step of enabling a secondary NN since I didn’t actually want a secondary NN (I was about to turn back on HA NN).
– the NN didn’t start back up because it was complaining that it wanted to format itself (ack!) but there were files in hdfs and didn’t want to kill them (thank goodness).
– I edited the ambari py scripts to disable the automatic-format (in ambari-agent on the NN, edited hdfs_namenode.py to disable calls to format_namenode). This got the NN up and running.
4. Ambari decided “you have HA already enabled” and wouldn’t let me re-enable it through the HA wizard.
– I ended up going through the ambari source to figure this one out (was looking for a DB flag; it’s not in the database — it’s actually “if the secondary namenode exists in the config then it is not namenode-HA, otherwise it is).
– Fixed by adding a secondary namenode (but didn’t configure)
5. Ambari then allowed the NN HA wizard to run. This worked for the most part (it failed starting nagios; there was a missing config option & I had to edit the ambari scripts to temporarily work around it until I could add the option in ambari)