Apache Ranger - Hive over HDFS Audit Logs

Overview

Apache Ranger allows for centralized authorization and auditing for Apache Hadoop and related technologies. Ranger auditing can be stored in multiple locations including Apache Solr and HDFS. With HDFS storing audit logs for compliance purposes, we needed a way to query these logs. Apache Hive provides the ability to query HDFS data without a lot of effort. This has been written about before here as well. The post below outlines a similar approach with slightly different details.

Ranger Audit Log Format on HDFS

Currently, Apache Ranger stores audit logs in HDFS in a standard JSON format. The audit schema is detailed here. Previously I wrote about how these logs could be compressed here. The format could change to make the process below easier with RANGER-1837.

Creating the Hive table over Ranger Audit Logs on HDFS

Since the data is in JSON format on HDFS, there are a few options for what Hive SerDe to use. The built in Hive JSON SerDe has issues when the data is malformed, which from experience can happen with the Ranger audit data. @rcongiu has a Hive JSON SerDe - https://github.com/rcongiu/Hive-JSON-Serde that can handle the malformed JSON data easily.

The Ranger audit log format on HDFS dictates that the folder structure include component and evtDate. This folder structure makes it possible to define a partitioned Hive table which improves query performance when looking at a specific date range or component.

ADD JAR hdfs:///INSERT_PATH_TO/json-serde-VERSION-jar-with-dependencies.jar;

DROP TABLE IF EXISTS ranger_audit;
CREATE EXTERNAL TABLE ranger_audit (
  resource string,
  resType string,
  reqUser string,
  evtTime TIMESTAMP,
  policy int,
  access string,
  result int,
  reason string,
  enforcer string,
  repoType int,
  repo string,
  cliIP string,
  action string,
  agentHost string,
  logType string,
  id string
)
PARTITIONED BY (component String, evtDate String)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH SERDEPROPERTIES ( "ignore.malformed.json" = "true");

For each partition (component and evtDate), you will need to alter the table to add the partition. I scripted this out to automatically add new partitions as necessary. Example of doing this for one partition:

ALTER TABLE ranger_audit ADD IF NOT EXISTS PARTITION (component='COMPONENT_NAME', evtDate='DATE') LOCATION 'DATE_FOLDER';

Using Hive to Query Ranger Audit Logs on HDFS

After the Hive table has been created, it is possible to issue Hive SQL queries across the Ranger audit logs. Some query examples are below:

-- Get total number of events
SELECT count(1) FROM ranger_audit;

-- Get total number of events for HDFS
SELECT count(1) FROM ranger_audit WHERE component='HDFS';

-- Get total number of events on 2018-03-23
SELECT count(1) FROM ranger_audit WHERE evtDate='2018-03-23';

-- Get total number of events for HDFS on 2018-03-23
SELECT count(1) FROM ranger_audit WHERE component='HDFS' and evtDate='2018-03-23';

These queries can be run in multiple different tools including Apache Zeppelin which can generate charts and graphs. Additionally, after the Hive table has been created it can be easily exposed over JDBC/ODBC.