I have a file in HDFS inside my HortonWorks HDP 2.3_1 VirtualBox VM.
If I go into the guest spark-shell and refer to the file thus, it works fine
val words=sc.textFile(“hdfs:///tmp/people.txt”)
words.count
However if I try to access it from a local Spark app on my Windows host, it doesn’t work
val conf = new SparkConf().setMaster(“local”).setAppName(“My App”)
val sc = new SparkContext(conf)
val words=sc.textFile(“hdfs://localhost:8020/tmp/people.txt”)
words.count
Emits
Exception in thread “main” org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-452094660-10.0.2.15-1437494483194:blk_1073742905_2098 file=/tmp/people.txt at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:838) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:526)
The port 8020 is open, and if I choose the wrong file name, it will tell me
Input path does not exist: hdfs://localhost:8020/tmp/people.txt!!
localhost:8020 should be correct as the guest HDP VM hat NAT port tunneling to my host Windows box.
I also opened up some extra ports
http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.2.1/bk_reference/content/reference_chap2_1.html
And it’s telling that if I give it the wrong name I get an appropriate exception
My pom has
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>1.4.1</version>
<scope>provided</scope>
</dependency>
Am I doing something wrong? And what is the BlockMissingException trying to tell me?
And how do I fix it, if at all?
I just want to be able to do my Spark dev on my windows box in Scala IDE whilst talking to my HDP sandbox.
Thanks.