As enterprises begin to centralize applications, systems, and data into Hadoop data lakes, core systems like ERP, CRM and other data attached to large sensor datasets relate together and open opportunities, integrations, and business models that have not been available before.
The foundations of data lake have a core requirement — real-time streaming data integration from every source. Gartner reports that the value of the data changes as new datasets and opportunities become introduced into the data lake, so capturing streaming data from sensors and system data require certain tools. As 90% of data is abandoned or orphaned on ingestion and 60% of sensory data loses value within a couple of milliseconds, the ability to stream data from every source is key.
Enterprises must be prepared to: 1) Analyze data from 100s to 1000s of sources of business and machine data, 2) Analyze data anywhere — on-premises or in the cloud from databases, data warehouses, Hadoop systems, in-memory structures, etc., and 3) Analyze data in real-time to capture new/changing data and process/stream in motion.
As an example, Attunity a Fortune 10 customer in the auto industry who has taken this real-time integration approach for over half of all of their 4,500 applications and systems to account for over 100,000 tables streaming into the data lake. Using Attunity Replicate’s change data capture process (CDC) they maintain true real-time analytics with less overhead.
Applications Stream Data
The advent of streaming platforms from on-premises Hadoop infrastructures, i.e., Kafka or AWS’s Kinesis or Azure’s Event Hubs help support data overload issues from too many sources and the dedicated compute nodes to avoid contention between inject and reporting. The ability to manage complexity from different systems and reformat in multiple dynamic data types and schemas. These systems also reduce data loss.
As an example, one of the largest mortgage processors in the Southeastern United States, used Attunity’s solutions to centralize data on Kafka as the universal transaction log for all of their sources. The way that they process their data creates a means for managing stream publishers from documents, e-mails, web logs, click streams, social networks, machine-generated data, sensor data, and geo-location data.
Other firms have found that relational data systems such as the OLTP, ERP, and CRM systems can use Attunity Replicate to process data, as it’s published per transaction, as it is pulled in increments, or batched in bulk loads at common intervals. By using Kafka, the integration of older messaging systems that struggled with persistence where it was costly, optional and detrimental to performance of getting information in real-time don’t have this issue. Both standards for messaging applications and platforms of choice are Apache ActiveMQ and RabbitMQ, Apache Kafka persists messages automatically and unconditionally, as well as, understanding that all messages acknowledged in order. This focus on messaging does not factor for data transformations or task scheduling, as it focuses on the improved performance.
To learn more:
- Watch the Real-time Data Pipelines with SAP and Apache Kafka on-demand webinar
- Watch the Real-time Data Integration for IoT Analytics on-demand webinar
- Read the Leveraging Mainframe Data for Modern Analytics knowledge brief