Why we built Daturic
It's hard to believe that it has been over 15 years since the quote "Data is the new oil" was coined by Clive Humby. It was later expanded on to illustrate the fact that, by itself, oil doesn't do much. It needs to be refined into something useful, such as gas or plastics. In much the same way, data needs to undergo this transition, the ultimate goal being that it can be used to derive business value.
It turns out that we've learned quite a bit about using data since the time of that quote. In 2019, Samuel Flender suggested that, unlike oil, "it is actually far from clear how exactly to turn that data into profits". Because of this, first companies needed to handle data (ie: make it available to data scientist) and then they need to "data science" it. Typically, data scientists want to make use of the latest tools and techniques to hopefully derive value from the data at hand. Since the nature of this work is exploratory, most of the tools being used are not COTS, but rather hand crafted code. Lastly, data scientists almost always insist on having access to production data. As security practitioners, we recognize that many of the processes that data scientists follow bear a large amount of risk. In the article quoted above, the author goes on to say "The process of working with data is messy, requires careful planning, engineering, and research, and contains a lot of unknowns and pitfalls." No kidding.
Fast forward to current day, and it seems like most advanced analytics teams are following a common set of patterns. These patterns almost always centre on the use of a data lake to store large amounts of data. Connectors are then used from tooling (such as Azure Databricks) to allow both user and automated processes to crunch the data into, hopefully, something meaningful. From a security perspective, then, it makes the most sense to start with security on the data lake itself.
The anatomy of a data lake
Under the hood, data lakes are pretty simple structures. Typically, from a cloud provider perspective, they are built on top of blob storage with the addition of a virtualization layer that add a couple of core capabilities. The first is compatibility with existing APIs, such as HDFS, to support the ecosystem of tools that are already built for that use case. The second is the addition of a layer that makes directory operations first-party. This is to support the big data uses cases where an entire directory of files is operated on to support the required processing. When big data engines reference the use of tables, they are typically just referring to a directory on a data lake.
From a big data processing standpoint, most workflows adopt the multi-hop (or medallion) approach. That is to say, tables are organized in layers that correspond to the different quality levels of the data contained in them. Raw (or Bronze) is typically where the results of data ingestion processes are stored. Refined (or Silver) represent the transformed raw data which is generally standardized in a common format for other processing. The last layer, feature (or Gold) houses data that is ready for consumption in other reporting or analytics tools.
What about security?
There are likely many security controls one could suggest to help deal with data lakes across the lines of people/process/technology. In the case of data lakes, however, they would ultimately all boil down to the fact that tables are represented as folders in the data lake itself, and those folders will contain some type of access control list that governs who can access and who cannot.
If you are an IT professional reading this, you'll quickly realize that you've seen this pattern before. This is the basis for almost all "shared drive" implementations at any large company, anywhere. When you think fondly about your current shared drive implementation, it is likely that the terms "manageable", "understood" and "secure" don't really come to mind.
When you read about processes involving data and data lakes, they all talk about the data "flowing like water" to its' destination. This is ultimately the goal, but the question becomes, how do we do actually do that, in a secured manner, and have it not turn into the next shared drive.
This is the reason why we built Daturic. A Daturic data lake:
- Puts Security First
- Is cloud native
- Leverages automation
- Helps your administrators
The features and functionality that went in to Daturic are all there to increase the overall security of your data lake. This includes the implementation of technical controls, such as managing permissions, but also other controls, such as audits, to assist with your overall security health
All of our processes and tools are designed using cloud native tools. There are no proxies in our solution. As such, it scales automatically with how your data scales.
From a lifecycle perspective, building/designing your data lake is only one part of the puzzle. Daturic helps to codify "Day 2" operational processes which allow for the secure use of your data lake from an operational standpoint.
Your administrators are your line of defence when securing your data lake. They are the ones that ultimately implement your data security policies. Our intent is to equip them with the information they need to make appropriate access control decisions, minimize mistakes by enforcing best practices, and increasing the auditability of your data usage.