AWS Partner Network (APN) Blog
Data Lake House: An Open Alternative to a Data Lake and Data Warehouse for Big Data Applications
By Dan Marks, VP Business – Mactores Cognition
By Kiran Randhi, Sr. Partner Management Solutions Architect – AWS
Mactores |
Driven by big data and cost-effective cloud storage, the hype around data lakes as an alternative to traditional data warehouse systems promised to capture the value of large and complex data assets at scale.
In practice, however, data lakes are quickly loaded with large volumes of data in a variety of formats and structures. Depending on the data orchestration and data movement strategies, applying consistent data governance and quality controls to a data lake becomes challenging.
Similarly, the schema-on-write construct used to build rigid database models with traditional data warehouse systems also fails to meet the needs of modern big data applications. The vast volume, velocity, and variety of big data make it challenging to model all of the information assets in a unified and structured format.
Without following an adequate data governance framework, data quality remains elusive, especially as the data is managed and retained in silos and organizations struggle to achieve a holistic enterprise-wide view of all of their big data assets.
In this post, we will discuss how data lake house technology helps overcome the limitations of data lake and data warehouse systems. We’ll discuss the architectural characteristics of the data lake house and how these help users optimize data orchestration workflows to maximize performance efficiency and enforce strict governance controls.
Mactores Cognition is an AWS Advanced Tier Services Partner with the Data and Analytics Competency that is focused on delivering results-oriented business outcomes.
What is a Data Lake House?
The term “data lake house” refers to a new architecture pattern that emerged as an alternative to traditional data warehouse and data lake technologies, promising an optimal tradeoff between the two approaches to store big data.
A data lake house is primarily based on open and direct-access file formats such as Apache Parquet, supports advanced artificial intelligence (AI) and machine learning (ML) applications, and is designed to overcome the challenges associated with traditional big data storage platforms.
Figure 1 – Caption goes here.
The following limitations of a data warehouse and data lake have driven the need for an open architectural pattern that takes the data structures, management, and quality features from data warehouse systems and introduces them to the low-cost cloud storage model employed by the data lake technology.
Lack of Consistency
Data lake and data warehouse systems require significant cost and time efforts to maintain data consistency, especially when sourcing large streams of data from a variety of sources. For instance, a failure at a single extract, transform, load (ETL) step is likely to introduce data quality issues that cascade across the big data pipeline.
Slow Updates and Data Staleness
Storage platforms built on data warehouse architecture suffer from the issue of data staleness when frequently updated data takes days to load. As a result, IT is forced to use (relatively) outdated data to make decisions in real time in a market landscape where proactive decision-making and the agility to act is a key competitive differentiation.
Data Warehouse Information Loss
Data warehouse systems maintain a repository of structured databases, but the strong focus on data homogenization negatively affects data quality. Converging all data sources into a single unified format often results in the loss of valuable data due to incompatibility, lack of integration, and the dynamic nature of big data.
Data Lake Governance and Compliance Challenges
Data lake technology aims to address challenges by transitioning to the schema-on-read model that allows you to maintain a repository of raw heterogeneous data with multiple formats without a strict structure. While it can be argued the stored information is mostly static in nature, any external demands on updating or deleting specific data records are a challenge facing data lake environments.
There’s no easy way to change reference data or index and update a data record within the data lake without first scanning the entire repository, which may be required by external compliance laws such as CCPA and GDPR regulations.
Data lake house architecture offers key capabilities to address the limitations associated with data lakes and traditional data warehouse systems:
- Schema enforcement and evolution: Allows users to control the evolving database structures and data quality issues.
- Ready for structured and unstructured data: Enables a wider choice of data management strategies, allowing users to choose between the various schema and models based on applications.
- Open-source support and standardization: Ensures compatibility and integration with multiple platforms.
- Decoupled from the underlying infrastructure: Allows IT to build a flexible, composable cloud infrastructure and provision resources for dynamic workloads without breaking the applications running on it.
- Support for analytics and advanced AI applications: Runs directly from the data lake instead of reformulating copies of information in a data warehouse.
- Optimal data quality: Atomicity, consistency, isolation, and durability (ACID) compliance previously available in the data warehouse systems ensures data quality as new data is ingested in real time.
Data Lake House Use Case Example
Mactores recently worked with a large biotech company that was struggling to manage an on-premises traditional Oracle-based data platform. The company could not scale new use cases for machine learning teams as new information sources were added, leading to siloed information assets and inadequate data quality.
Following an in-depth, eight-week assessment and multiple workshops to understand the company’s true business requirements and big data use cases, Mactores worked with their business and IT teams to transform the existing data pipeline and incorporate an advanced data lake house solution.
Post-implementation, the result was 10x agility in their DataOps process and 15x improvement for ML applications as teams spent less time on fixing data quality issues, and more time building business-specific ML use cases.
Additionally, the customer’s business teams now have access to self-service platforms to build custom workflows and process new datasets critical to their decision making, all independent of the IT team.
The customer’s improvements were largely associated with the approach of using the data lake house architectural pattern to unify robust data management features and advanced analytics capabilities, making critical data accessible to business users from all departments within the organization.
The open format further allowed the company to overcome issues such as performance, inadequate availability, high cost, and vendor lock-in that comes with the increased dependence on proprietary technologies. Stringent regulatory requirements were also met as the data lake house allows users to upsert, delete, update, and change data records at a short notice.
Conclusion
Over the long term, the transition to open-source technologies and the low cost of cloud storage will contribute to the rising popularity of big data storage architectural patterns that offer an optimal mix of features from data warehouse and data lake systems.
As this trend continues to rise, the choice between data warehouse, data lake, and data lake house will largely depend on how well users can maximize data quality and establish a standardized data governance model to achieve the desired return on investment (ROI) on their big data applications.
Mactores Cognition – AWS Partner Spotlight
Mactores is an AWS Advanced Tier Services Partner and trusted leader among businesses in providing modern data platform solutions.