This article written by Mark Kromer, Senior Product Manager, Microsoft and Clive Bearman, Director of Product Marketing, Attunity explains why adding change data capture (CDC) to your data integration processes just makes good sense.
Lately, Microsoft Azure Data Factory seems to have caused a stir in the Attunity Universe with customers, prospects, account reps and SE’s emailing me the same question.
It’s a fair question and one that’s easily answered. But, first let’s review some fundamentals and see what caused such a disturbance in the force.
What is Microsoft Azure Data Factory?
Azure Data Factory (ADF) is the Azure data integration service in the cloud that enables building, scheduling and monitoring of hybrid data pipelines at scale with a code-free user interface. Additionally, you can process and transform the data along the way by using compute services such as Azure HDInsight, Spark, Azure Data Lake Analytics, Azure SQL Database, Azure SQL Data Warehouse, SQL Server, Azure Databricks and Azure Machine Learning.
How Does ADF Differ from other ETL Tools?
There’s some nice features that distinguish ADF from the rest of the cloud pack. ADF has a clean user interface to define dataflows, also .NET, REST and Python API’s if you feel the need to code your own ETL tasks. It’s also got the ability to run SSIS packages too. That’s handy if you’ve got some legacy SSIS integrations hanging around and are looking for a path to something more modern. Finally, there’s a windows-based integration runtime (aka data gateway) that helps you retrieve data from “on-premises” systems.
You can use ADF in the same way that you’d use any traditional ETL tool. Modernizing a data warehouse, aggregating data for analytics and reporting, or acting as a collection hub for transactional data. However, the primary goal in this release is to migrate your data to Azure Data Services for further processing or visualization. Not surprising that Power BI is a good complement to ADF as we can see from the diagram below. In this example ADF aggregates blog comments from a blob store with data from an on-premises SQL server. It then transforms the data and delivers the result to an Azure SQL Database. Finally, a Power BI dashboard visualizes the data and insights emerge! Perhaps learning that avid blog readers also purchase the most widgets on Tuesdays.
A Real-World Retailer
In this scenario, a retailer has launched a data warehouse modernization initiative and has decided to implement a cloud data lake. They’ve decided not to use an on-premises Hadoop solution, opting instead for Azure. The retailer is using Azure Data Factory to populate Azure Data Lake Store with Power BI for visualizations and analysis. The high-level architecture looks something like the diagram below:
The obvious solution to keeping data fresh is to schedule Azure Data Factory pipelines to execute every few minutes. The ADF copy activity is primarily built for copying whole tables of data and not just the rows that have changed or copy time-partitioned buckets of data files. So, as size of the data source gets bigger, the more data you need to copy and the longer the tasks take to manage and execute.
One solution is to use the ADF “incremental copy template” and only copy new rows from the source. However, this process relies on the data source having a “watermark” table to keep track of database activity over time. In addition, extra pipeline logic is required to calculate the time stamps, retrieve the data for the relevant time period, and update the watermark table ready for the next ADF pipeline run. See below:
Like ETL, Change Data Capture (CDC) is not a new concept. The technology emerged two decades ago to help replication software vendors deliver real-time transactions to data warehouses. It works via standard change-log mechanisms to continuously identify and capture incremental changes to data and data schemas from sources such as enterprise-grade databases. These small incremental changes are then automatically propagated to target systems via a low-latency data transfer. The impact on operational systems is also extremely low since the source systems don’t require integration agents or additional database queries. As a result, CDC has the benefit of being easier to administer and manage than other types of data integration. Furthermore, there’s no “batch window” and the data in target systems are always up-to-date.
Adding Attunity Replicate to Azure Data Factory
It’s easy to see how our retail scenario could benefit by adding Attunity Replicate to our architecture now that we understand the concept of CDC. A possible configuration might look something like the diagram below:
We’ve ultimately arrived at the best solution. We can use Azure Data Factory for our ETL-based data pipeline integrations, like in our first scenario. Then we can use Attunity Replicate to ingest transaction data into our Azure Data Lake Store in-real time. The combined solution provides the best, most accurate data for our business analysts and users.
Comparing Azure Data Factory and Attunity Replicate
By now you should have gotten a sense that although you can use both solutions to migrate data to Microsoft Azure, the two solutions are quite different. ADF is a cloud-based ETL service, and Attunity Replicate is a high-speed data replication and change data capture solution. Use ADF when you want to create dataflows that copy complete tables of information or incrementally load delta data in batch workflows. Use Attunity Replicate when you want to ingest large volumes of data into Azure Data Services and keep those data targets up-to-date with incremental data. The two solutions are extremely complimentary and work well together. The illustration below is a handy comparison that you can print and stick to your wall as a reminder.
Attunity and Microsoft have been business partners for over twenty years. From the early days of providing change data capture for Microsoft SQL Server Integration Services, to a renewed OEM agreement announced just last month.
In November 2017, Attunity working with Microsoft launched a dedicated product called “Attunity Replicate for Microsoft Migrations” which was designed to accelerate the migration of enterprise data at scale from various database systems to the Microsoft data platform. Both companies sponsor this no-fee migration offering that you can download here.
Proving the strength of a combined solution can be easily demonstrated by customers who use both Microsoft Azure and Attunity. A recent example is a global telecommunications company that announced a strategic agreement to leverage Attunity as the key technology for it’s Microsoft Azure Data Lake Initiative. Details can be seen here.
Finally, follow this link for all the information about Attunity and Microsoft partnership.
Microsoft Azure Data Factory is the Azure data integration service in the cloud that enables building, scheduling and monitoring of hybrid data pipelines at scale with a code-free user interface. The ETL-based nature of the service does not natively support a change data capture integration pattern that is required for many real-time integration scenarios. Adding Attunity Replicate to an ADF installation is good practice and ensures that the combined solution delivers the most relevant and accurate data for business analysis.
Next Steps – Download Attunity Replicate for Microsoft Migrations
Register to download a free version of the award-winning Attunity Replicate that’s specifically tailored to help you migrate commercial data sources to the Microsoft Data Platform. In just a few minutes you’ll be able to easily ingest data into your Microsoft data products from diverse sources such as Oracle databases and enterprise data warehouses like Teradata. Try it for yourself today.