While few technology sectors really sit still, the data segment is especially dynamic. Open source innovators and vendors are churning out a dizzying set of architectural options for today’s architects and CIOs. Placing strategic bets has rarely been more difficult.
At Attunity, we advise some of the world’s largest enterprises as they whiteboard their data environments and data integration plans. We have found that our customers can change plans midstream, based on new business requirements or lessons learned. As they go through this trial and error process, they find that data ingest is a repeated requirement rather than one-time event. Sources and targets may change over time. In addition, data often now flows through pipelines or multiple processing zones, often coming to rest in familiar analytics destinations: data warehouses or data marts.
Let’s explore common data integration requirements we help enterprises address. Most projects entail at least two of four primary use cases. Here is a summary of the use cases and the motivation for each:
- Data lake ingestion. Data lakes based on HDFS or AWS S3 have become a powerful complement to traditional data warehouses because they hold and process higher volumes of more data types.
- Database transaction streaming. Businesses need to capture perishable data value and streaming new/changed business records real time to analytics engines makes this possible. In addition, sending incremental updates in this fashion eliminates the need for batch loads that disrupt production.
- Data extraction and loading from production sources such as mainframe, SAP and Oracle. Revenue, supply chain and other operational data from these core enterprise systems hold a mountain of potential analytics value, especially when analyzed on external platforms and mixed with external data such as clickstream or social media trends.
- Cloud migration. Resource elasticity, cost savings and reduced security concerns have made the cloud a common platform for analytics.
Managing Change at a Managed Health Services Provider
As an example, we are working with the CIO of a managed health services provider to publish records real-time from a DB2 iSeries database to Kafka, which in turn will go through Flume to feed their Cloudera Kudu columnar database for high performance reporting and analytics. Our change data capture (CDC) technology starts the whole data flow by non-disruptively copying live records (millions each day) as they are inserted or updated on the production iSeries system. This is a great example of an initiative involving multiple use cases: data lake ingestion, database streaming and production database extraction. Many variables are changing at the same time.
The change does not stop there, because this customer has not fully settled on Kudu. They might instead (or also) run analytics on Hive, which effectively serves as a SQL data warehouse within their data lake, depending on what they learn about the analytics workload behavior.
Changing the Ingredients at an International Food Manufacturer
Another example of complexity and change is a Fortune 500 food manufacturer that is using our CDC technology to feed a new Hadoop data lake based on the Hortonworks Data Platform. They efficiently copy an average of 40 SAP record changes every five seconds, decoding that data from complex source SAP pool and cluster tables. Attunity Replicate injects this data stream, along with periodic metadata and DDL changes, to a Kafka message queue that feeds HDFS and HBase consumers that subscribe to the relevant message topics (one topic per source table.)
Once the data arrives in HDFS and HBase, Spark in-memory processing helps match orders to production on a real-time basis and maintain referential integrity for purchase order tables within HBase and Hive. As a result, they have accelerated sales and product delivery with accurate real-time operational reporting. They have replaced batch loads with change data capture to operate more efficiently and more profitably.
But once again this is not the end of the story. They are now moving their data lake to an Azure cloud environment to improve efficiency and reduce cost.
We believe these and other companies are making the right strategic choices based on the best information available at each point in time. As our customers evolve, they are adopting a few guiding principles to navigate the complexity.
- Trial and error is the norm. You cannot know how well various technologies will support your queries until you run them. Once you do, you might decide to move to an alternative.
- Platform flexibility is critical. The trial and error process leads naturally to the need for flexibility in data integration processes. Our customers benefit from having a single console and single automated process that can add, remove or change any major source or target on the fly.
- Need to reduce developer dependency. Rising data integration demands, and the frequent need for change, run the risk of overburdening ETL programmers and making them a bottleneck. The more enterprises automate, the more they empower architects and DBAs to integrate data quickly without programmers.
- Multiple zones/pipelines will become common. The most effective data lakes are really becoming canals, with locks that control data flow between them. Put differently, a sequence of zones in the data lake can run data through a series of transformations prior to analytics, with the ability to rewind at any point in the event of error.
- Data warehouse structures matter. Many organizations are starting to treat their data lakes as transformation areas that siphon off prepared data sets for structured analysis on traditional data warehouses. SQL, data warehouse and ACID-compliant structures often yield the most effective analytics results.
There is no decoder ring for architecting your data environment. Our customers are finding these five guiding principles provide consistent guard-rails to improve the odds of success.
To learn more: