The Utopia of Data Lake Permissions

By Shamir Charania on May 24, 2021

It is a telling sign of the times that one of the new and emerging metrics in the cyber security space is the number of records breached by calendar quarter. According to the Risk Based Report, there were a total of 36 billion records exposed by the end of Q3 2020. That is an astronomical number of records, and far exceeds to number of records breached in 2019. The report goes on to suggest that the largest source of records is misconfigured databases, and this is typically attributed to human error.

While the report specifically calls out databases, it is important to take a step back and look at the broader picture of the data lifecycle. Companies looking to take advantage of their data are starting with a data lake approach. The combination of cheap storage prices, optimized storage formats (such as delta), and the ability to perform distributed processing (read: spark) make data lakes an ideal component to use. The first step, of course, it to make full extracts of source data (many of which might be these misconfigured databases) available in a consumable format. Many follow the medallion pattern, continuously refining data as it becomes more end-user consumable.

The other consideration here is how one actually comes to delivering some type of insight on the data that they do have access to. As an outsider looking in on the practices of data scientists, it seems that results are more “stumbled upon” then necessarily pre-planned. This type of exploration work, by its’ very nature, is necessarily unknowable up front. That is the sole reason why you are crunching the data to being with. To further complicate the matter, the expertise to understand if results are valuable are not are typically located in the business, and not in IT. This means that traditional approaches of governance, from an IT perspective anyways, are not going to work here. Business users want to move fast, and want access to the data they need when they need it. But we need to balance this with security and governance. An insecure data lake is no longer access to one database, but several databases, often updated in real time and with access to the network speeds of the cloud.

So, with these added personas to contend with, what does the utopia of data lake permissions look like? Here are some thoughts:

Data lake permissions must be granular

The structure of the data lake should support providing permissions to only the data required when it is required. Further considerations need to be placed on how much data can actually be accessed. Not all exploratory work requires access to all the historical data, for example.
Data lake permissions must be phased in

Exploratory work ultimately requires access to the raw data, but the first step is providing access to some metadata so analysts/scientists can first assess if they need access to that data.
Data lake permissions must be just-in-time

Many personas that access data only need that access for a small amount of time. This even holds true for service accounts that perform the actual data moves.
Data lake permissions must be provisioned quickly

When adopting a just-in-time strategy, one needs a way of quickly granting that access