Apache Knox - Audit Logging - Duplicate Correlation IDs

Overview

Apache Knox is a reverse proxy that simplifies security in front of a Kerberos secured Apache Hadoop cluster. Knox has the capability to audit the actions of users. Audit logging uses the concept of a correlation id to track a single request through the system. The audit log fields are described here with the REQUEST_ID containing the value of the correlation id. As part of debugging Knox interactions, we found that multiple requests had the same correlation id which should never happen. I emailed the Apache Knox mailing list to determine if this was expected. The Knox community agreed this behavior was not correct and created KNOX-1091.

Determining how Knox `Correlation IDs` are generated

Apache Knox generates correlation ids using a Jetty handler. The CorrelationHandler sets the RequestId using UUID.randomUUID(). UUID.randomUUID() uses SecureRandom to generate a new string. In the auditing cases, the UUID doesn’t have to be cryptographically secure but shouldn’t generate duplicates easily.

Theories for duplicate `Correlation IDs`

I led my team through an exercise of brainstorming how Knox could be generating duplicate correlation ids.

Some ideas we came up with:

Java UUID.randomUUID() returning the same UUID under high load
Jetty not thread safe under high load
HTTP 1.1 pipelining requests handling
LDAP integration cause 401 errors and duplicating requests
Shiro not thread safe with session management
Knox CorrelationHander not thread safe

Reproducing Duplicate `Correlation IDs`

I was able to reproduce duplicate correlation ids using Apache Bench when Knox didn’t use authentication. I also determined that SSL did not change the behavior. Apache Knox 1.0.0 generated the following output for 1000 requests 100 concurrently.

[knox-1.0.0]$ grep -F access logs/gateway-audit.log | cut -d'|' -f3 | sort | uniq -c | sort -n | tail -n5
a781630c-93b8-48c2-a6fd-aaa428a6bf14
f0ab4c10-0dc7-41ef-a27a-9aebe9c8ce58
985df820-a82c-4704-b773-016769413cc2
b46b3f23-5514-401c-bfee-440790e54b31
73e77c01-054b-4063-8fd2-c1a3cabdfe4c

Work around for Duplicate `Correlation IDs`

Setting gateway.threadpool.max to 6 in gateway-site.xml, I was able to prevent duplicate correlation ids from being generated. This prevents multiple requests from happening in parallel and therefore identifies this as a thread safe problem.

Determining the Root Cause of Duplicate `Correlation IDs`

It looks like Log4j Mapped Diagnostic Context (MDC) is the middle piece that tries to hold the correlation id. I’m not convinced that MDC is being handled correctly with the Jetty threadpool. From what I gather threads being reused can cause issues with MDC if it is not cleaned out between uses. I don’t see any places where MDC.remove or MDC.clear is called except in tests.

Some references:

Apache Knox doesn’t look like it is clearing the MDC between each request. The distribution of duplicate correlation ids correlates with how Jetty prefers recently busy threads.

Where to go from here?

I have been looking at fixing the MDC handling in Apache Knox. The MDC needs to be cleared after each request and before a new request is handled. This will take some time to generate a test that will prevent a regression. Follow KNOX-1091 for continued updates.

Overview

Determining how Knox Correlation IDs are generated

Theories for duplicate Correlation IDs

Reproducing Duplicate Correlation IDs

Work around for Duplicate Correlation IDs

Determining the Root Cause of Duplicate Correlation IDs