Its nice to set it up on a sandbox vm. But lets get real. There’s a huge gap when you want to install and run on a real cluster.
Looking at the spark apache website:
https://spark.apache.org/docs/latest/running-on-yarn.html
There are gaps.
It seems to imply that we add the spark properties in the yarn-site.xml (Is this the case?)
If we decide to push the jar to hdfs, then we have to set the shell variable:
export SPARK_JAR=hdfs:///some/path.
This would involve updating the yarn template, no?
On your site, you write:
<property>
<name>yarn.application.classpath</name>
<value>/etc/hadoop/conf,/usr/lib/hadoop/*,/usr/lib/hadoop/lib/*,/usr/lib/hadoop-hdfs/*,/usr/lib/hadoop-hdfs/lib/*,/usr/lib/hadoop-yarn/*,/usr/lib/hadoop-yarn/lib/*</value>
</property>
Silly me, where’s the spark library? (Is this a typo?)
Does Hortonworks require the spark libs to be from the HDP site, or will the release from the Apache site work?
And if we update the property does that mean we have to push the jars to every node, rather than use HDFS?
Thx
-Mike