Change Data Capture Defined: How It Works & More
It’s no secret that the data-driven approach can help your company make smarter business decisions, beat your competitors and better serve your clients. According to a study by McKinsey & Company, organizations that are “intensive users of customer analytics” are 23 times more likely to excel at acquiring new customers and 19 times more likely to be highly profitable.
With a practically unlimited supply of data sources, and fresh data constantly being generated, there’s just one issue: How do you know when there’s new information you should integrate into your existing workflow?
Change data capture (CDC) can help you solve this problem. But what is change data capture, and what role does it play during the Extract, Transform, Load (ETL) process?
What is Change Data Capture?
“Change data capture” describes a set of processes and techniques to detect changes made to a source table or database. As the name suggests, the goal of change data capture is to identify when information in a database changes and then incorporate those changes into the data integration process.
Once the changes are identified, the corresponding rows or records will move via data integration to the appropriate destination (usually a centralized repository such as a data warehouse or data lake). You can then use this new information, together with the repository’s existing contents, for business intelligence (BI) and analytics workflows. CDC assures that the data in your source and target locations sync up, allowing your data repository to serve as a single source of truth for the entire organization.
While change data capture is most often associated with standard relational databases, in theory CDC can work with any data store that allows you to identify the changes between its current and previous states. This includes relational and non-relational databases, SaaS (software as a service) applications, CRM (customer relationship management) software, ERP (enterprise resource planning) mainframes, etc.
The Types of Change Data Capture and How They Work
There are a number of change data capture methods, each with its own pros and cons. The types of change data capture include:
Log-based: A log-based change data capture solution looks at a database’s transaction log, which is a metadata file that records all database transactions to enable faster recovery after a crash. Since the information in a transaction log is generally complete and of high quality, log-based CDC produces a high level of accuracy. However, because different database systems (e.g., Microsoft SQL Server, PostgreSQL, MySQL, etc.) may have proprietary transaction log file formats, log-based CDC can be fairly difficult to program correctly.
Trigger-based: A trigger-based change data capture solution makes use of database triggers, i.e., functions that launch when a given event occurs (such as loading new data into the database). Trigger-based CDC makes it easier to identify changes to the database, but it also adds overhead to the database itself since these trigger functions execute whenever new data arrives.
Database columns: Many databases include columns such as LAST_UPDATED or DATE_MODIFIED, which show the time and date of the last change to a row. You can sort the records in a database by this DATE_MODIFIED column to identify the records that have gone through changes since the last time you extracted information from the database. This change capture method method is straightforward, but it is unlikely that it will identify deleted records (since they are no longer present in the database), and it may be inflexible to changes in the database schema.
Diff: The diff command in Unix-based operating systems (and equivalents such as fc in Windows) can help you compare and identify the differences between two different database files — the current state and the previous state at the time of the last data extraction. Using diff as your change data capture solution can assist with detecting deleted records (since they will be present in the previous file), but performing scans of both files may be computationally intensive for large databases.
Change Data Capture and ETL
Of course, change data capture is tremendously important for the ETL process. ETL first extracts the information from one or more databases and data sources and then transforms it as necessary to fit the target location. Finally, ETL loads the info into a data warehouse or data lake.
ETL is a challenging and technically complex process from start to finish, and the extract phase is no exception. For example, how does your ETL tool know which information it should replicate? Extracting an entire database is tremendously wasteful when most of it is already present within your centralized data repository, especially when you consider that this information will need to pass through the transformation phase as well.
That’s where change data capture comes in. The right CDC solution can isolate exactly those fields and records that have changed since the last run of ETL and then target them for extraction. By preventing untold hours of unnecessary work, change data capture is an essential part of modern ETL and data integration workflows.
ETL often breaks down into batch processing and real-time ETL:
Batch ETL is the traditional method of ETL: It extracts, transforms and loads new information in batches at regularly scheduled intervals. For example, batch ETL jobs may run overnight to generate new analytical insights and reports that key decision-makers can read with their morning coffee.
Real-time ETL (also known as streaming ETL) is an ETL method that ingests new information as soon as possible.
Both real-time and batch ETL have their advantages and use cases. The traditional batch ETL approach is technically simpler and compatible with legacy IT, while real-time ETL can help enable up-to-the-minute analyses and is more compatible with cloud environments.
The good news is that change data capture can work with either batch or real-time ETL. Recall, for example, that trigger-based CDC uses database triggers, which are executed in response to a change in the underlying data. This makes trigger-based CDC a good fit for real-time ETL, which will facilitate new insights as quickly as possible.
On the other hand, methods such as diff are more computationally intensive because they require scanning through files to identify the latest changes to a database. This makes these methods more suitable for a batch ETL approach, in which the ETL process runs less frequently.
There’s just one question left: How can you perform change data capture for your own ETL and data integration needs? Fortunately, modern ETL tools usually come with built-in CDC capabilities so that you can automatically detect the data that has changed since the previous execution. When choosing an ETL tool, look for one with CDC functionality that is compatible with the kind of ETL workflow you want to build.