The “data lake” has become a major buzzword in the world of big data. Data advocates initially pitched the idea of a data lake as a solution for all unstructured data, an alternative to the restrictions of a data warehouse. But recently, the concept of the data lake has begun to earn more detractors than supporters, leaving many questioning the validity and benefits of the data lake.
The fact is that a data lake can be very helpful, so long as it is used correctly. So we’re taking a look at the benefits and dangers of using data lakes to store our large datasets, and how businesses can intelligently utilize them.
What exactly is a data lake?
A Data lake is an enormous, easily accessible, centralized repository, that stores structured and unstructured data from source systems. Data is not classified when entered into the data lake, so data preparation is eliminated; only when data is accessed is it classified, organized or analyzed.
Data had been traditionally stored in data warehouses, which was pulled from multiple sources, transformed and structured, and defined by very specific parameters. Data warehouses are useful, because the data within them was regulated and trustable. However, 80% of the data doesn’t fit this model, because it’s considered semi-structured or completely unstructured. This means it doesn’t have a pre-defined data model, or is not organized in a pre-defined way. So to store this data, instead of having to properly integrate all of it into one model, data lakes allowed for data to stay as is, to be dealt with at another time.
Why are data lakes useful?
Data lakes are able to store data in its native format, unintegrated at the point of entry. The data requirements and schema are not defined until the data is queried for analysis. In a data lake, these requirements can be prepared for specific analytical uses as needed, which is more flexible and efficient. As opposed to the data warehouse format, where the schema is established when the data is entered, limiting data analysis to one particular use.
Another benefit of data lakes is that they eliminate information silos. Instead of storing multitudes of independently-managed datasets, the data lake now collects all sources of data into one place. This consolidation encourages data sharing and increases available information. Not only does this cut down on costs, but allows for increased insights due to the information sharing.
The issue with data lakes
Data research group Gartner made waves by publishing a press release claiming we should ‘Beware of the Data Lake Fallacy’, warning analysts to steer clear of the hype surrounding the data lake. Gartner discussed the false assumption that enterprises are highly skilled at data manipulation and analysis, as this data lacks semantic consistency and governed metadata, and needs appropriate processing. If there is no oversight, IT professionals don’t need to spend time understanding the data they possess, they simply dump it in the data lake. Eventually, the danger is that the lake will be a collection of disconnected data pools, unorganized, in one place.
Another major risk in data lakes is the inability for analysts to determine data quality, because a thorough checkup has not taken place. Also, there is no way to use insights from others who have worked with the data, as there is no account of the lineage of findings by previous analysts. Finally, one of the biggest risks of data lakes is security and access control. Data can be placed into a lake without any oversight, and some of the data may contain privacy and regulatory requirements that other data doesn’t.
The bottom line is, a data lake can be very useful, and make your data analysis more efficient and specialized. On the other hand, if your data lake is unregulated, and unsupervised by trusted IT professionals, you run the risk of creating a mess, a lake filled with data from unrecognized sources, differing levels of data quality - literally a swamp where once was a lake. So if you’re company is choosing the data lake route, keep those waters clean.