What is a data lake?
A data lake is a central data repository where structured, semistructured, and unstructured data can be stored at any scale, usually as blobs or files. Contrary to a data warehouse, the schema is not known at the time of storage. Because data lakes come with cheap storage, they are an excellent choice for storing data such as server logs, clickstream output, and IoT event logs.
These are typical properties of a data lake:
- Data lakes store everything, nothing is turned down on arrival
- Schema-on-read: All data is stored in raw form. It is only transformed when needed
- Data lakes are extremely adaptable and change-tolerant.
- Tooling flexibility regarding access to the data
While structured data is great for operational analytics and dashboarding, schemaless storage, such as a data lake, is a great way for data scientists and ML engineers to explore data from a variety of sources when developing predictive models.
Data lake infrastructure
There are many popular and open-source frameworks, such as Apache Hadoop, with an established ecosystem to serve many use cases. However, many organizations have moved their data infrastructure to the cloud and use cloud storage services such as Google Cloud Storage or Amazon S3 as a data lake.
In most situations, a data lake works in tandem with a data warehouse. The data lake is the intermediary solution that receives and stores all the data before it is cleaned, transformed, and loaded into the data warehouse.