Apache Livy - Apache Spark, HDFS, and Kerberos

Overview

Apache Livy provides a REST interface for interacting with Apache Spark. When using Apache Spark to interact with Apache Hadoop HDFS that is secured with Kerberos, a Kerberos token needs to be obtained. This tends to pose some issues due to token delegation.

spark-submit provides a solution to this by getting a delegation token on your behalf when the job is submitted. For this to work, Hadoop configurations and JAR files must be on the spark-submit classpath. Specifically which configurations an JAR files are explained in a references here.

When using Livy with HDP, the Hadoop JAR files and configurations are already on the classpath for spark-submit. This means there is nothing special required to read/write to HDFS with a Spark job submitted through Livy.

If you are looking to do something similar with Apache HBase see this post.

Assumptions

Apache Ambari for managing Apache Spark and Apache Livy
Apache Knox in front of Apache Livy secured with Kerberos
Apache Hadoop HDFS secured with Kerberos

`run.sh`

curl \
  -u ${USER} \
  --location-trusted \
  -H 'X-Requested-by: livy' \
  -H 'Content-Type: application/json' \
  -X POST \
  https://localhost:8443/gateway/default/livy/v1/batches \
  --data "{
    \"proxyUser\": \"${USER}\",
    \"file\": \"hdfs:///user/${USER}/SparkHDFSKerberos.jar\",
    \"className\": \"SparkHDFSKerberos\",
    \"args\": [
      \"hdfs://PATH_TO_FILE\"
    ]
  }"

`SparkHDFSKerberos`

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class SparkHDFSKerberos {
  public static void main(String[] args) {
    SparkConf sparkConf = new SparkConf().setAppName(SparkHDFSKerberos.class.getCanonicalName());
    JavaSparkContext jsc = new JavaSparkContext(sparkConf);

    JavaRDD<String> textFile = jsc.textFile(args[0]);
    System.out.println(textFile.count());

    jsc.stop();
    jsc.close();
  }
}