Coding Hassle : HDFS Security: Authentication and Authorization

Hadoop Distributed File System (HDFS) security features can be confusing at first, but it in fact follows simple rules. We can examine the topic in two parts: Authentication and Authorization.

I am testing with Hadoop 1.0.3 on Centos 5.8 server. In my client, I am using Centos 6.5.

1. Authentication

Authentication is determining whether someone is really who he claims to be.

Hadoop supports two different authentication mechanisms specified by the hadoop.security.authentication property which is defined in core-default.xml and its site specific version core-site.xml.

simple (no authentication)
kerberos

By default, simple is selected.

1.1. simple (no authentication)

If you have installed Hadoop with its defaults, there is no authentication made. This means that any Hadoop client can claim to be any user on HDFS.

The identity of client process is determined by host operating system. In Unix-like systems, this is equivalent to "whoami". This is the username that user is working under. Hadoop does not provide user creation or management.

Let's say, in your client machine you have typed:

> useradd hadoopuser
> su hadoopuser
> hadoop dfs -ls /

hadoopuser will be sent as user identity to NameNode. And after that, permissions checks (which is a part of authorization process) will be done.

Implications

This mode of operation poses great risk.

Think about this scenario:
1. In your Hadoop cluster, you have started NameNode daemon as user "hdfsuser". In Hadoop, the user running NameNode is super-user and can do anything on HDFS. Permission checks never fail for super-user.
2. There is a client machine Hadoop binaries installed on and this machine has network access to Hadoop cluster.
3. Owner of the client machine knows NameNode address and port. He also knows that super-user is "hdfsuser", but does not know its password.
4. To access HDFS and do much more, all he needs to do is to create a new user "hdfsuser" in client machine. Then he can run Hadoop shell commands with this user.

> useradd hdfsuser
> su hdfsuser
> hadoop dfs -rmr /

When he runs these commands, Hadoop believes the client that "he is hdfsuser" and executes requested command. As a result, all data on HDFS is deleted.

1.2. kerberos

I have no hands-on experience with this mode.

2. Authorization

Authorization is function of specifying access rights to resources.
First we can examine HDFS permissions model and then see how permissions checks are done.

2.1. HDFS Permissions Model

HDFS implements a permissions model for files and directories that shares much of the POSIX model.

We can continue by an example. If you run:

> hadoop dfs -ls /

You will see something like:
Found 2 items
drwxr-xr-x - someuser somegroup 0 2014-03-05 15:01 /tmp
drwxr-xr-x - someuser somegroup 0 2014-03-03 11:01 /data

1. Each file and directory is associated with an owner and a group. In the example, someuser is the owner of /tmp and /data directories. somegroup is the group of these directories.

2. The files or directories have separate permissions for the user that is the owner, for
other users that are members of the group, and for all other users.

In the example, drwxr-xr-x determines this behaviour. First letter d specifies whether it
is a directory or not. rwx show owner permissions, r-x shows permissions for group users,
and last r-x show permissions for other users.

3. For files, the r permission is required to read the file, and the w permission is required to write or append to the file. For directories, the r permission is required to list the contents of the directory, the w permission is required to create or delete files or directories, and the x permission is required to access a child of the directory.

4. When a new file/directory is created, owner is the user of client process and group is the group of its parent directory.

One note:
In HDFS owner and group values are just Strings, you can assign non-existent username and groups to a file/directory.

2.2. Permission Checking

Authorization takes places after user is authenticated. At this point, NameNode knows the username, -let's say hdfsuser-, then it tries to get groups of hdfsuser.

This group mapping is configured by the hadoop.security.group.mapping property in core-default.xml and its site specific version core-site.xml. Default implementation achieve this by simply running groups command for the user. Then it maps the username with returned groups.

Group mapping is done on NameNode machine NOT on the client machine. This is an important realization to make. hdfsuser can have different groups on client and NameNode machines; but what goes into mapping is the groups on NameNode.

When NameNode has username and its groups list, permission checks can be done.

If the user name matches the owner of file/directory, then the owner permissions are tested;

Else if the group of file/directory matches any of member of the groups list, then the group permissions are tested;

Otherwise the other permissions of foo are tested.

If a permissions check fails, the client operation fails.

2.3. Super-User

The user running NameNode daemon is the super-user and permission checks for super-user never fails.

2.4. Super Group

If you set dfs.permissions.supergroup in hdfs-site.xml, you can make members of given group also super-users. By default, it is set to supergroup in hdfs-default.xml.

<property>
    <name>dfs.permissions.supergroup</name>
    <value>admingroup</value>
</property>

In later Hadoop versions, property is named as dfs.permissions.superusergroup.

2.5. Disable HDFS Permissions

If you set dfs.permission to false, permission checking is disabled. However this does not change the mode, owner or group of files/directories.

<property>
    <name>dfs.permissions</name>
    <value>true</value>
</property>

In later Hadoop versions, property is named as dfs.permissions.enabled

Coding Hassle

6 Nisan 2014 Pazar

HDFS Security: Authentication and Authorization