Apache Livy - Apache Spark, HDFS, and Kerberos
Overview
Apache Livy provides a REST interface for interacting with Apache Spark. When using Apache Spark to interact with Apache Hadoop HDFS that is secured with Kerberos, a Kerberos token needs to be obtained. This tends to pose some issues due to token delegation.
spark-submit
provides a solution to this by getting a delegation token on your behalf when the job is submitted. For this to work, Hadoop configurations and JAR files must be on the spark-submit
classpath. Specifically which configurations an JAR files are explained in a references here.
When using Livy with HDP, the Hadoop JAR files and configurations are already on the classpath for spark-submit
. This means there is nothing special required to read/write to HDFS with a Spark job submitted through Livy.
If you are looking to do something similar with Apache HBase see this post.
Assumptions
- Apache Ambari for managing Apache Spark and Apache Livy
- Apache Knox in front of Apache Livy secured with Kerberos
- Apache Hadoop HDFS secured with Kerberos
run.sh
curl \
-u ${USER} \
--location-trusted \
-H 'X-Requested-by: livy' \
-H 'Content-Type: application/json' \
-X POST \
https://localhost:8443/gateway/default/livy/v1/batches \
--data "{
\"proxyUser\": \"${USER}\",
\"file\": \"hdfs:///user/${USER}/SparkHDFSKerberos.jar\",
\"className\": \"SparkHDFSKerberos\",
\"args\": [
\"hdfs://PATH_TO_FILE\"
]
}"
SparkHDFSKerberos
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
public class SparkHDFSKerberos {
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setAppName(SparkHDFSKerberos.class.getCanonicalName());
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
JavaRDD<String> textFile = jsc.textFile(args[0]);
System.out.println(textFile.count());
jsc.stop();
jsc.close();
}
}