
Imagine for each MapReduce job, if all the worker tasks have to authenticate via Kerberos using a delegated TGT (Ticket Granting Ticket), the Kerberos Key Distribution Center (KDC) would quickly become the bottleneck. While it’s theoretically possible to solely use Kerberos for authentication, it has its own problem when being used in a distributed system like Hadoop. To read about more details of Hadoop security design, please refer to the design doc in HADOOP-4487 and the Hadoop Security Architecture Presentation.
We will give a brief introduction of some of the above mechanisms in the Other Ways Tokens Are Used section at the end of this blog post.
The end user and the distributed tasks can access HDFS DataNodes using Block Access Tokens. The HDFS DataNodes talk to the HDFS NameNode using Kerberos. This will be the focus for the rest of this blog post The distributed tasks the end user (joe) submits can access the HDFS NameNode using joe’s Delegation Tokens. The end user (joe) can talk to the HDFS NameNode using Kerberos. In a simple HDFS example above, there are several authentication mechanisms in play: Below is a simplified diagram that illustrates where Kerberos and Delegation Tokens are used in the context of HDFS (other services are similar):įigure 1: Simplified Diagram of HDFS Authentication Mechanisms Especially, Delegation Token is introduced to achieve the second goal (see next section for how). Other mechanisms such as Delegation Token, Block Access Token, Trust etc. Any servers of the cluster are authenticated to be part of the cluster.įor this goal, Kerberos was chosen as the underlying authentication service. Any clients accessing the cluster are authenticated to ensure they are who they claimed to be. In order to achieve the first goal, we need to ensure Not adding significant cost while achieving goal #1. Preventing the data stored in HDFS from unauthorized access. The security feature was added on later via HADOOP-4487 in 2010 with the following two fundamental goals: Hadoop was initially implemented without real authentication, which means data stored in Hadoop could be easily compromised. A previous blog post about general Authorization and Authentication in Hadoop can be found here. For readers who are not interested in HDFS transparent encryption, the KMS portions in this blog post can be ignored. This blog assumes the readers understand the basic concept of authentication, Kerberos, for the purpose of understanding the authentication flow as well as HDFS Architecture and HDFS Transparent Encryption, for the purpose of understanding what HDFS and KMS are. It is noteworthy that there are a lot of other services in the Hadoop ecosystem that also utilize delegation tokens, but for brevity we will only discuss about HDFS and KMS. How often does a kms client attempt to renew the activation? code#
This blog post introduces the concept of Hadoop Delegation Tokens in the context of Hadoop Distributed File System ( HDFS ) and Hadoop Key Management Server ( KMS ), and provides some basic code and troubleshooting examples. Delegation tokens were designed and are widely used in the Hadoop ecosystem as an authentication method. However, due to a lack of documentation around this area, it’s hard to understand or debug when problems arise.
Apache Hadoop’s security was designed and implemented around 2009, and has been stabilizing since then.