Announcing Attunity Compose for Hive – A new way to accelerate data loading and transformation for Hadoop Data Lakes.
Attunity Compose for Hive automates the data pipeline to create analytics-ready data by leveraging the latest innovations in Hadoop such as the new ACID Merge SQL capabilities, available today in Apache Hive (part of the Hortonworks 2.6 distribution), to automatically and efficiently process data insertions, updates and deletions.
Attunity Compose for Hive automates the creation, loading and transformation of data into Hadoop Hive structures. It fully automates the pipeline of business intelligence (BI) ready data into Hive, to create both Operational Data Stores (ODS) and Historical Data Stores (HDS). Attunity Compose leverages the latest innovations in Hadoop such as the new ACID Merge SQL capabilities that are available today in Apache Hive (part of the Hortonworks 2.6 distribution), to process data insertions, updates and deletions.
Attunity Replicate integrates with Attunity Compose to accelerate data ingestion, data landing, SQL schema creation, data transformation and ODS & HDS creation/updates.
With Attunity Compose for Hive, you have:
- Real-time data ingestion and landing. Leverage tight integration with Attunity Replicate to ingest data in batch or via continuous data capture (CDC), then copy that data to an on-premises or cloud target.
- Comprehensive automation. Generate Hive schemas automatically for ODS and HDS targets, and all necessary data transformations are seamlessly applied.
- Continuous, non-disruptive data store updates. Leverage the ANSI SQL compliant ACID MERGE operation to process data insertions, updates and deletions in a single pass.
- Transaction consistency. Partition updates by time to ensure each transaction update is processed holistically for maximum consistency.
- Improved operational visibility. Support slow changing dimensions to understand change impact with a granular history of updates such as customer address changes, etc. within the Historical Data Store.
Data Automation to Hive in Five Steps
Step 1: Use Attunity Replicate ingest data into Hadoop and partition the data
Attunity Replicate transfers data into Hadoop and the HDFS files systems in parallelized formats via WebHDFS and HttpFS protocols or over NFS and connects to HCatalog via ODBC and HQL Scripts. As data is loaded into Hadoop, the process of data partitioning is introduced as a way of creating metadata to address the consistent, transactionally verified datasets. Data files are uploaded to HDFS, according to the maximum size and time definition, and then stored in a directory under the change table directory. Whenever the specified partition timeframe ends, a partition is created in Hive, pointing to the HDFS directory.
Step 2: Connect to the Hadoop Cluster and configure CDC and ETL process
The images below showcase the connections into Hive and into the source database, Northwind, a MySQL instance.
By optionally storing the history of changes through the Manage Metadata -> Save Changes screen, you have the ability to select design an Operational or Historical data stores.
Step 3: Generate HIVE LLAP code for loading data
Attunity Compose considers these key items while generating Hive ETL calls:
- Extracting data from the sources (initial load and CDC)
- Loading data into landing zone in transactionally consistent data partitions to maintain integrity
- Transforming data in the landing zone from sequence to ORC format
- Handling ETL for DELETE operations
- Scaling to support large number of sources, tables and truncations with the considerations of parallel processing of tasks
- Inelegantly managing number of parallel ETL processes to prevent Hadoop cluster overload.
By adding some changes to the source system, the data becomes delivered to [table]’_delivery’ zone, which is where the final presentation layers.
By carrying audits throughout the process with another set of tables for audits per record in [table]’_landing HIVE tables that have change tables and a record of the table’s partitions. The CDC partitions create records of when changes hit those partitions in the ‘attrep_cdc_partitions.’
By reviewing the content, the latest merge content gets introduced. By looking at the latest updates and merges record by reviewing the ‘I’ (Insert) and ‘U’ (Update) statements, as well as, appending to process to reconcile, where a delete occurred.
Step 4: Configure the Parallelism and Optimizations needed
Throttling of run to overload the Hadoop cluster (by limiting the number of SQL statements we run), within the manage ETL set under ETL Commands, settings, then advanced to address the number of max concurrent DB connections to use.
Step 5: Show the data through Hive
Finally, a reconciled delivery zone of data presented through Hive.
In summary, the business benefits of Attunity Compose for Hive are:
• Faster data lake operational readiness
• Reduced development time
• Reduced reliance on Hadoop skills
• Standard SQL access
Would you like to try Attunity Compose for Hive? To learn more or participate in the beta program, please click here.