Data Ingestion

Definition, types, and use cases. This guide provides definitions, use case examples and practical advice to help you understand data ingestion.

What is Data Ingestion?

Data ingestion refers to the tools & processes used to collect data from various sources and move it to a target site, either in batches or in real-time. The data ingestion layer is critical to your downstream data science, BI, and analytics systems which depend on timely, complete, and accurate data.

Types and Use Cases

Your unique business requirements and data strategy will determine which ingestion methods you choose for your organization. The primary factors involved in this decision are how quickly you need access to your data, and which data sources you’re using.

There are three main ways to ingest data: batch, real-time, and lambda, which is a combination of the first two.

  1. Batch Processing
    In batch-based processing, historical data is collected and transferred to the target application or system in batches. These batches can either be scheduled to occur automatically, can be triggered by a user query, or triggered by an application.

    The main benefit of batch processing is that it enables complex analysis of large historical datasets. Also, traditionally, batch has been easier and less expensive to implement than real-time ingestion. But, modern tools are quickly changing this equation.

    ETL pipelines support batch processing (ETL is an acronym for “Extract, Transform, and Load”). Converting raw data to match the target system before it is loaded, allows for systematic and accurate data analysis in the target repository.




    If you need timely, near real-time data but your data integration architecture prevents you from employing stream processing, micro batching is a good option to consider. Micro batching splits your data into groups and ingests them in very small increments, simulating real-time streaming. Apache Spark Streaming is actually a micro-batch processing extension of the Spark API.

  2. Real-Time Processing

    In real-time processing, also known as stream processing, streaming pipelines move data continuously in real-time from source to target. Instead of loading data in batches, each piece of data is collected and transferred from source systems as soon as it is recognized by the ingestion layer.

    A key benefit of stream processing is that you can analyze or report on your complete dataset, including real-time data, without having to wait for IT to extract, transform and load more data. You can also trigger alerts and events in other applications such as a content publishing system to make personalized recommendations or a stock trading app to buy or sell equities. Plus, modern, cloud-based platforms offer a lower cost and lower maintenance approach than batch-oriented pipelines.

    For example, Apache Kafka is an open-source data store optimized for ingesting and transforming real-time streaming data. It’s fast because it decouples data streams, which results in low latency, and it’s scalable because it allows data to be distributed across multiple servers. Learn more about Apache Kafka.


    Real-time data ingestion framework:



    • A CDC (change data capture) streaming tool allows you to aggregate your data sources and connect them to a stream processor. Your CDC tool should continually monitor transaction and redo logs and move changed data.

    • A tool such as Apache Kafka or Amazon Kinesis allows you to process your streaming data on a record-by-record basis, sequentially and incrementally or over sliding time windows.

    • Use a tool such as Snowflake, Google BigQuery, Dataflow, or Amazon Kinesis Data Analytics to filter, aggregate, correlate, and sample your data. This will allow you to perform real-time queries using a streaming SQL engine for Apache Kafka called ksqlDB. You can also store this data in the cloud for future use.

    • Now you can use a real-time analytics tool to conduct analysis, data science, and machine learning or AutoML without having to wait for data to reside in a database. And, as mentioned above, you can trigger alerts and events in other applications such as a content publishing system to make personalized recommendations to users or a stock trading app to buy or sell equities.

  3. Lambda Architecture

    Lambda architecture-based ingestion is a combination of both batch and real-time methods. Lambda consists of three layers. The first two layers, batch and serving, index your data in batches. The third layer, the speed layer, indexes in real-time any data that has not yet been ingested by the slower batch and serving layers. In this way, there is a continual balance between the three layers. This ensures that your data is both complete and available for you to query with minimal latency.

    The benefit of this approach is that it brings you the best of both batch and real-time processing. It gives you the full view of your historical batch data while also reducing latency and eliminating the risk of data inconsistency.

Top 4 Strategies for Automating Your Data Pipeline

Benefits

Data ingestion is the primary, foundational layer of your data integration and analytics architecture. Here are the key benefits:

Data Availability. Data ingestion helps make data from across your organization readily available for analysis and for your downstream applications.

Data Transformation. Modern data pipelines using ETL tools transform the wide variety of data types from a range of sources—such as databases, IoT devices, SaaS applications, and data lakes—into a predefined structure and format before delivering it to the target system.

Data Uniformity. Data ingestion tools are flexible enough to process unstructured data and a range of data formats into a unified dataset that you can perform BI and analytics on.

Data Insights. Ingestion feeds your analytics and BI tools, which in turn allow you to gain valuable insights on how to improve your company’s performance.

Data Application. You can also use ingested data to improve your applications and provide your users with the best experience.

Data Automation. Many manual tasks can be automated with a data ingestion process. This will save you time and money and let your team focus on other priorities.

Challenges

Data pipelines continue to become easier to set up and maintain but they can still pose challenges such as the following.

Data Security. When transferring data from sources to target systems, your data may be staged multiple times throughout your pipeline. This added exposure can make your sensitive data more vulnerable to security breaches. Plus, you’ll need to comply with data security regulations–such as GDPR, HIPAA, and SOC 2–which will add complexity and cost to your process.

Data Scale and Variety. Your data volume, velocity, and variety has most likely increased dramatically in recent years. Big data ingestion can result in performance challenges such as ensuring data quality and conformity to required format and structure. Plus, your data types and sources may continue to grow, which makes it hard for you to “future-proof” your data ingestion framework.

Data Fragmentation. Your data can become fragmented and duplicated if different groups in your organization ingest data from the same internal and/or third-party sources.

Data Quality. During a complex data ingestion process, the reliability of your data can be compromised. As part of your data governance framework, you should establish a process to check data quality and completeness.

Manage Quality and Security in the Modern Data Analytics Pipeline

Data Ingestion Tools & Capabilities

Data ingestion tools are software products that automate the collection and transfer of structured and unstructured data from source to target systems, either in batches or in real-time.

Your source systems will often have different ways of processing and storing data than your target systems. Data ingestion tools and data pipeline software automates the process of extracting data from your many source systems, transforming, combining and validating that data, and loading it into the target repository.

Four Main Approaches

  • Hand coding a data ingestion pipeline can bring you the most control of your process but it requires significant development expertise to build and maintain.

  • Data ingestion tools save you from coding your own pipeline by allowing you to use pre-built connectors and transformation, typically in a drag-and-drop interface. But this type of single-purpose tool requires you to manage and monitor every pipeline you create.

  • Data integration platforms bring capabilities for every step of your data’s journey. This can bring an end-to-end pipeline but requires development expertise to architect, build, and maintain.

  • A DataOps approach aligns both sides of the data-delivery equation (data engineers and data consumers) and automates as much of the process as possible.

Capabilities and features

Whatever approach you choose, your data ingestion tool should have the following capabilities and features:

  • Data extraction involves collecting data from all required sources, such as databases, IoT devices, SaaS applications, and data lakes.

  • Data processing, whether in scheduled batches or real-time, makes your data ready and available for immediate use within downstream applications (or for storage).

  • Data transformation from different types of structured, semistructured, and unstructured data into a predefined structure and format is required before delivering it to the target system.

  • Security and privacy features such as encryption and support for protocols such as HTTP over SSL and Secure Sockets Layer are a must. Sensitive data will need to be obfuscated and protected.

  • Scalability means your tool must be able to handle ever larger volumes and workloads.

  • Data flow tracking and visualization make it easier for you to understand the flow of data through your system.

Data Ingestion vs ETL

As stated above, the term “data ingestion” refers to the set of tools and processes used to collect data from various sources and move it to a target site for immediate use or for processing and storage. ETL (Extract, Transform, and Load) pipelines are a particular type of data pipeline.

Below are three key differences between the two:

First, ETL pipelines usually move data to the target system in batches on a regular schedule. Data ingestion pipelines don’t necessarily have to run in batches. They can support real-time processing with streaming computation, which allows data sets to be continuously updated.

Second, ETL pipelines transform data before loading it into the target system. Data ingestion pipelines can either transform data after loading it into the target system (ELT) or not transform it at all.

Third, ETL pipelines end after loading data into the target repository. Data ingestion pipelines can stream data, and therefore their load process can trigger processes in other systems or enable real-time reporting.

Learn More About Data Integration With Qlik