Shallow focus photo of balance stones.

Exploring Data Lake Security Personas

By Shamir Charania on

A couple of years ago I was working on a project that involved data stored in a data lake. We were close to go-live of our production analytics system, and the pressure on the development teams to meet deadlines was climbing by the day. The speed at which the development teams were moving was causing a huge backlog for the operational teams. We took a security first approach to data access, but this caused a bunch of friction as we were ill equipped (with tooling) to react quickly to access requests and permission changes. This story has a sad, but predictable, ending. One day, while trying to use the built-in cloud provider tooling, one of our operational team members clicked the wrong button, and ended up deleting a huge amount of data from the data lake. The rest of the team, already stretched thin and exhausted, had to rally to attempt to recover. As the dust settled, questions were asked. What was that person doing in the data lake? Why did they have the permissions to make such a drastic change? The long story short, operational resources with privileged access to the data lake were attempting to make access control changes to support the business.

Securing data within an organization is a team effort. Key data security roles, or personas, are usually filled by different individuals with different areas of focus and expertise. The goal would be that each persona has the correct level of access to the data lake. Further, the goal would be that no single individual (or team) fills the responsibility of more than one persona. Security issues arise when this goal is not met, as is evident in my story above. An administrator, trying to make access control changes, ended up deleting data because they had the permission to do so.

Data lake security personas

Game pieces on a board

So let's explore what these personas on the data lake are, and what permissions they actually need to perform their main functions.

Data Governance

Data governance has always been an important role within an organization. Governance is needed by organizations in order to ensure that it is meeting it's obligations to outside regulators, governing laws, and stakeholders/investors in an effective manner. As organizations centralize their data, governance is required to ensure all the data can live "side-by-side" without breaking any of the rules on that data. For example, just because an organization has access to a data set, doesn't mean it necessarily has the right to process it (or the right to process it in certain ways). Think about data that could be subject to copyright or personal data that may be protected under privacy legislation such as the General Data Protection Regulation (GDPR).

From a technical perspective, data in the data lake can only be organized in one hierarchy (unless you plan on doing data duplication). Because of the way permission inheritance works in data lakes, due care needs to be given to the initial structure of the data lake. While most companies solve this by breaking data up by source system, these divisions don't always work in the long term (after a project has become operational, for example).

So, if we were to break down the tasks that the data governance persona needs to perform, it would look like the following:

  • Contribute to the initial design of the hierarchy (of folders) in the data lake
  • Create a metadata taxonomy to keep track of key attributes of data (as it is moved out of source systems)
  • Audit the metadata attributes
  • Audit who/what is using the data

Reading the above, you'd think that the data governance person shouldn't need direct access to the data to perform these operations, but the truth is, in the major cloud provider implementations, they need fairly privileged access. Even in the basic scenario, reading metadata about data, they would require read access to the storage blobs to perform the relevant API calls.

Identity and Access (IAM) Administrator

Traditionally, identity and access administrators execute required permission changes on various targets. This role has typically been centralized so that onboarding/offboarding/auditing functions that surround access controls (a key regulatory requirement) are easier to accomplish. As data systems become more specialized, the responsibility for identity and access administration has typically fallen to super users in the target systems. The most common example of this would be database administrators (DBAs) that are responsible for running all grant statements on a SQL server.

From a technical perspective, access to perform these operations requires very privileged access to the data itself. In Azure, for example, one would need fairly direct access to the data (either using Azure RBAC or using POSIX) to perform access control operations. In order to achieve the granularity one would need, POSIX is likely the preferred approach, requiring full read/write to the data in order to change access control lists. In AWS the story is a little cleaner, requiring IAM administrators to create/assign appropriate policies.

If we break down the tasks that the IAM administrator needs to perform it would be:

  • Granting/Revoking access to data
  • Performing access reviews

As with the Data Governance role describe above, the IAM Administrator doesn't technically need access to the data to perform their tasks, but because of the technical implementation they often have/need direct access.

Infrastructure Administrator

The infrastructure administrator, or commonly called cloud operations (CloudOps), is typically responsible for the overall security of the resources deployed in the target cloud providers. They need access to the resources in order to securely configure them. Examples of tasks they may do:

  • Set encryption options
  • Set backup options
  • Enforce network security

In the cloud platforms, there is a close tie between administering resources (provisioning and configuring them) and administering access to those resources. Most CloudOps teams that I have encountered have fairly privileged access to both the management plane and the data plane of any given resource.

So what happens in the real world?

A flat pile of rocks

Every organization is different, and has invested differently in how their cloud resources are governed and managed. While it's hard to make generalizations, I'll describe two models that I have seen widely employed when securing data lakes.

Method 1: Analytics Teams play all roles

In this method, the journey to establishing a data lake is generally just starting. Project teams tasked with performing data analytics use data lakes as a hub for data processing. Because of this, they are usually granted the keys to governance, manage the resources end-to-end, including security and access control. Typically, because they play all roles, permissions are done with a focus on data access for project teams. Little thought is given to the data governance aspect and to establishing good metadata practices. Data analytic teams have context only to the project scope they have been assigned, and nothing else. Access control decisions are driven from this view point and typically don't scale well.

Method 2: Infrastructure Admins play all roles

In this method, the infrastructure admins are tasked with enforcing data security on the lake. They use their extensive permissions to perform access control operations, usually under the direction of someone else. The infrastructure admins typically have no context on what the data is (or how it can be used) and simply respond to project team asks as tickets pile in queues. This leads very quickly to ad-hoc permission being created (administrators do not always follow the same processes), which leads to a system that is difficult to audit.

How does Daturic bring balance?

bricks placed in vertical columns assending

One of the core design goals behind Daturic was to simplify how access is granted to data. In simplifying, we want to ensure that overall security goals are still met while allowing the various personas to play their respective parts in the data security process. So, how does Daturic achieve that?

Data Governance teams are focused on building policies

Policies are the main mechanism for granting access to data in Daturic. Through our policy and tagging engine, data governance teams can create sets of policies that are fit for business use. These policies can be applied to any data elements in any registered data lake. The key part here is that we've removed the need for Data Governance teams to access the data directly to perform these operations. Using our system they can browse the data lake (structure only), associate relevant metadata tags (and audit those), and group them into policies.

Identity and Access (IAM) Administrators make use of existing/known toolsets

Policies are crystalized into existing IAM systems, making it easy for IAM Administrators to continue to execute on access control requests. They can ensure appropriate approval for access requests, track that access in 3rd party tools such as Service Now, and perform routine operations. In the case of Azure, for example, all access control operations are done within Azure Active Directory group memberships functions. IAM administrators don't need access to the data, and they don't control the policy creation, keeping appropriate separation of duties among the data security personas.

Infrastructure administrators are off the hook

Infrastructure administrators help to setup Daturic as an application and configure it with appropriate access levels to the data lakes that have been approved for onboarding to the system. In Azure, they do this via standard Azure Role Based Access Control (RBAC) tools. Once configured, infrastructure administrators can turn over the tool to the other groups. Infrastructure administrators don't need access to the data in order to ensure that data access is being granted in an approved, repeatable way.

In conclusion

A Daturic data lake brings balance, from a security perspective, to the process of governing data. In a Daturic data lake, automation is used to ensure appropriate separation of duties are enforced across the data lifecycle, with respective personas using existing toolsets they are already familiar with to accomplish their goals.

Contact us to learn how Daturic can help you.