Loading navigation...
Unify Data
Logo
Core Concepts

Core Concepts

Logo

4 mins READ

In today's data-driven business environment, efficient data integration and management are crucial for organizations seeking to leverage their data assets effectively. UnifyApps offers a robust data pipeline solution that enables seamless data extraction, transformation, and loading across various sources and destinations. This article explores the core concepts behind UnifyApps' data pipeline architecture and functionality.

Data Extraction Methodologies

UnifyApps employs several sophisticated approaches to extract data from different types of sources:

Database Extraction

When working with traditional and modern databases, UnifyApps prioritizes real-time data capture through:

  • Native - Change Data Capture (CDC) - Leveraging built-in CDC mechanisms to track and capture changes as they occur

  • Log-based extraction - Reading database log files such as:

    • MySQL's binlog

    • Oracle's redo logs

These methods minimize performance impact on source systems while ensuring comprehensive data capture.

Data Warehouse Extraction

For data warehouses, UnifyApps adapts its approach based on the capabilities of the source system:

  • Native CDC integration, where supported by the warehouse

  • S3-based unloading, Extracting data to S3 storage before processing it through the pipeline

  • Periodic polling, Scheduled data retrieval at configured intervals

Application and Data Storage Extraction

When working with SaaS applications and other data storage systems, UnifyApps employs:

  • Regular data polling - Scheduled API calls to retrieve new or modified data

  • Change webhooks - Integration with application webhook systems to receive real-time notifications of data changes

Synchronization Phases

UnifyApps data pipeline operates in two distinct phases:

Snapshot Phase

The snapshot phase is the first step in data synchronization. During this phase, UnifyApps copies all existing data from the source system before starting ongoing updates. Think of it as taking a complete "photograph" of your data at a specific moment. 

Real-Time Phase

Following the snapshot, the real-time phase continuously captures and processes all new data changes occurring after pipeline deployment, maintaining data synchronization between source and destination.

Checkpoint Management

A critical aspect of UnifyApps' reliability is its checkpoint management system. Whenever a pipeline is:

  • Paused

  • Redeployed

  • Resumed

The system maintains precise checkpoints that track exactly which data has been processed. This ensures that when operations resume, the pipeline continues from the exact point of stoppage without data loss or duplication.

Deployment Infrastructure

UnifyApps leverages Kubernetes and Apache Flink for robust, scalable pipeline deployment:

  • Kubernetes provides the container orchestration platform

  • Apache Flink delivers the distributed processing framework

This combination enables high availability, fault tolerance, and efficient resource utilization for data pipelines of any scale.

Data Operations Support

The UnifyApps pipeline supports all standard data operations:

  • Inserts - Adding new records to the destination

  • Updates - Modifying existing records

  • Deletes - Removing records from the destination

A key feature is the update handling mechanism based on upsert keys. These keys are determined through schema mapping between source and destination systems, ensuring accurate record matching during update operations.

Conclusion

UnifyApps' data pipeline architecture represents a comprehensive approach to modern data integration challenges. By combining multiple extraction methodologies, robust synchronization phases, reliable checkpoint management, and enterprise-grade deployment infrastructure, UnifyApps enables organizations to build resilient, efficient data pipelines that support their data-driven initiatives.

Whether integrating databases, data warehouses, or applications, UnifyApps provides the foundation for seamless data flow across the enterprise technology ecosystem.