Member-only story
Change Data Capture(CDC) With DB
In this blog we will see how to capture delete operations from database when we build data pipeline.
Introduction
When we build a data pipeline the first step in the process is to identify the data source and how to extract that data in order to process.
During data extraction INSERT, UPDATE, SELECT queries can be performed without any concern but DELETE operation might be something we should think twice. DELETE operation might be hard delete and once data is deleted from the source we cannot capture it in data pipeline.
In this article, I will explain 3 ways to capture DELETE operation during the data extraction process.
# 1: Common Solution
If the source table in the database has update_timestamp then we can query the database based on what has changed since your last job run.
SELECT * FROM DB.table WHERE update_timestamp>last_run_time
In this case, you can extract soft delete(generally with _isdeleted=true flag) operation but we cannot capture hard delete operation.
Also, if update_timestamp is not indexed, running the above query can cause performance issues as well.