With the revolution of Big Data, many believe that the old way of storage, a structured data warehouse, is not well-fit anymore. This paves the way for the shift to the architecture of Data lakes. This is a large storage location that can hold vast quantities of data (mostly unstructured) in its raw format for future analysis consumption.
The most commonly used definition to visualize company data lakes comes from James Dixon, CTO of Pentaho, in 2010;
“If you think of a data mart as a store of bottled water - cleansed and packaged and structured for easy consumption - the data lake is a large body of water in a more natural state. The contents of a data lake stream in from a source to fill the lake and various users can come to examine, dive in, or take samples.”
Data lakes vs Data warehouses
The main commonality of data lakes and data warehouses is that they both are a data storage mechanism. The key difference between the two is that the company data lake stores all kinds of data, while data warehouse stores structured data. A data lake is not a replacement for a data warehouse, much rather, they are complementary to one another. To grab the concept of a data lake, it is easiest to make a comparison of both storage mechanisms, against multiple dimensions.
The nature of data
As mentioned previously, a warehouse only stores data that has been modeled, structured, or aggregated, whereas a data lake lets you store all kinds of data, structured, unstructured, semi-structured, in its native and raw format.
Before data can be loaded into the warehouse you first need to structure it. Often you need to model it into a star or snowflake schema, which follows schema-on-read (SQL). With a data lake, you don’t have to process beforehand. The data can be loaded as-is. When you are ready to use the data, you can give it shape or structure, which uses schema-on-write (NoSQL).
However, a challenge of the data lake can be that you have no oversight of the contents. To prevent this, you need to have defined mechanisms to catalog all of your data, otherwise, the lake can turn into a “data swamp”.
Because a data warehouse holds structured data and has many techniques built over the years to make it easy for you to retrieve data from the warehouse, the retrieval speed is very fast. However, with an enterprise data lake, it’s a time-demanding process.
Cost of storage
Data lakes are much cheaper to build than it is to build data warehouses. This is because one of the features of Big Data technologies, for example, Hadoop or Amazon Web Service, is that it is designed to be low-cost commodity hardware. A warehouse can take a long time to build from scratch, with the result that it would be very expensive in the end.
A data warehouse is a highly structured repository. While it is technologically not hard to change the structure of a data warehouse, it would be very time consuming because of all the business processes that are tied to it. A data lake, on the other hand, lacks this structure and therefore gives users the ability to easily configure and reconfigure their models, queries, and apps on-the-fly.
The technologies underlying data warehouses have been around for 20 to 30 years, therefore, they are very mature. However, the first data lakes started developing only around 10 years ago. Since they are still in the innovation phase, they are very new and still need to mature a lot to become the new mainstream data storage technology.
Due to the maturity of data warehouses, there have been advanced technologies regarding security. Ever since the developments of data lakes, there has been an emphasis on the security of the lakes. As large and growing volumes of diverse data are channeled into the data lake, it will store vital and often highly sensitive business data.
Besides, multiple business units can access the data lake freely and refine, explore, and enrich its data, using methods of their choosing, further increasing the risk of a breach.
So the question is, can you safely secure your Big Data in a lake? It takes some steps to secure your data lake, but the most important thing is that with recent technologies, it is possible to have a big data lake that is secured.
Data lakes are now used by data scientists. Data lakes are often difficult to navigate by those unfamiliar with unprocessed data. Raw, unstructured data usually requires a data scientist and specialized tools to understand and translate it for any specific business use.
In a data warehouse, we can find structured data that is easy to navigate for business professionals. Processed data, like that stored in data warehouses, only requires that the user be familiar with the topic represented.
In the end, you do not have to choose between a data warehouse or a data lake. The two are complementary and will both offer value to your business, in different ways. However, if you have not started using a data lake yet, it is highly recommended that you do so. With the recent innovations that allow you to capture more unstructured data, it is an asset of high value for your company to have a central place where you can store your data, in any form.
Overall, data warehouses will remain their value, but when used correctly, a data lake can deliver new and fresh insights into data that you already have.