Overview

Apache Livy provides a REST interface for interacting with Apache Spark. When using Apache Spark to interact with Apache Hadoop HDFS that is secured with Kerberos, a Kerberos token needs to be obtained. This tends to pose some issues due to token delegation.

spark-submit provides a solution to this by getting a delegation token on your behalf when the job is submitted. For this to work, Hadoop configurations and JAR files must be on the spark-submit classpath. Specifically which configurations an JAR files are explained in a references here.

When using Livy with HDP, the Hadoop JAR files and configurations are already on the classpath for spark-submit. This means there is nothing special required to read/write to HDFS with a Spark job submitted through Livy.

If you are looking to do something similar with Apache HBase see this post.

Assumptions

run.sh

curl \
  -u ${USER} \
  --location-trusted \
  -H 'X-Requested-by: livy' \
  -H 'Content-Type: application/json' \
  -X POST \
  https://localhost:8443/gateway/default/livy/v1/batches \
  --data "{
    \"proxyUser\": \"${USER}\",
    \"file\": \"hdfs:///user/${USER}/SparkHDFSKerberos.jar\",
    \"className\": \"SparkHDFSKerberos\",
    \"args\": [
      \"hdfs://PATH_TO_FILE\"
    ]
  }"

SparkHDFSKerberos

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class SparkHDFSKerberos {
  public static void main(String[] args) {
    SparkConf sparkConf = new SparkConf().setAppName(SparkHDFSKerberos.class.getCanonicalName());
    JavaSparkContext jsc = new JavaSparkContext(sparkConf);

    JavaRDD<String> textFile = jsc.textFile(args[0]);
    System.out.println(textFile.count());

    jsc.stop();
    jsc.close();
  }
}