Data lakes have multiple phases of maturity. The first phase can be described as a data reservoir. Wheras applications and their data stream into the data lake, from operational stores and reporting platforms, as is. In this way, data structures carry all history at the both the detail and summary levels. The datasets are more readily available, centralizes in one platform, and can expand the depth and breadth of the data warehouse to a variety of available datasets. The benefit of the new data sources/sets within a data lake are the large amount of distributed compute nodes where proper ETL processes could not be performed on basic transformation tasks. This process is enabled in an Extract, Load, Discover, and Transform (ELDT) methodology to address SLA gaps by offloading data preparation from data warehouses and mainframes to another parallel processing platform. This reduced cost moves transformation cycles to parallel performance platform of Hadoop and leads to greater levels of discovery.
The second phase is an exploratory data lake. In this phase, the process of expanding the data lake to enrich and enhance datasets by combining multiple sources together so that a forced imputation methods around data cleansing can be addressed. New Hadoop-based discovery and enrichment tools have arisen that address this opportunity to integrate and converge analytics and enrich data. These tools analyze data where it resides to bring the statistical tools like SAS, SPSS, R, with common SQL tools to address datasets for extend uses cases such as voice, image, Spatial, and graph based scenarios. This stops moving data to bespoke and constrained platforms that used to keep redundant datasets on laptops or servers in the closet.
The third and final phase of the data lake comes from the analytical lake where it has the ability to combine the exploratory analytics with the existing data warehouse, data structures, and data migration into and out of it. These types of self-service analytics pave the way that the data’s schema can be read when processing and allow you to bring multiple data structures to the singular analysis platform to create a single dataset quickly. The analytics are operationalized to combine the new exploratory Hadoop-based analytics tools with historical tools that commonly connected to data warehouses. All of this processing removes the ETL middleman and enables you to define your own data context, data enrichments and transforms. Plus, the transformation and modeling is done in place, allowing the data, not the data model, to drive the discovery process, and assembling the basic analytics steps of within that lifecycle. The way that other datasets can go into the archive, to prepare, and combine the data for further discovery and analysis.
Two Rules for Data and Four Key Considerations
There are two rules in today’s big data world, where you must move the compute to the real-time data, either for aggregation processing for real-time to immediately take action, in comparison to the levels of statistical processing data being archived, or you move it to a platform where it can scale. These scenarios lead to environments where immediate value and historical go from the exploratory analytical use cases to curate and enrich the data to find the value to building structured reporting scenarios to address everyday questions.
As data lakes evolve to accommodate some of the changing requirements of the business, there are four key considerations to consider.
- The core tables to address around MDM Integration would be bidirectional, tagging and linking tools such GraphDBs, which highlight the relationships in data.
- The focus on data quality where incoming data needs to discover contradictions, inconsistencies, and redundancies, as this will be become and more and more significant effort regarding all of the ingest frameworks.
- The automation of the security policy would the process authentication, authorization, encryption, and monitoring.
- The focus on proper data masking and security to protect the access to sensitive data has regulatory and additional auditing.