SSIS CDC: Changes Made To Source During Full Load
Carrying on with our look at CDC I want to introduce a common pattern that most people will see in the course of a day and also look at how this is dealt with using Change Data Capture in SSIS.
CDC is usually executed in two parts. The first part is the full load which essentially takes all the data in the source and deposits it into the destination. The second part is the CDC package which identifies changes in your source data and by means of the CDC Source Adapter introduces them into the data flow. This part is usually executed on a schedule through something like SQL Server Agent.
Most, if not all, of the examples you see shown doing anything with CDC in SSIS revolve around this pattern
FULL LOAD START –> DATA FLOW –> FULL LOAD END
CHANGES TO SOURCE DATA
CDC PACKAGE START –> DATA FLOW –> CDC PACKAGE END
<REPEAT ABOVE 2 STEPS>
This is fine for simple demonstrations of what happens however the chances of your source table standing still during the initial load (remember this could potentially take hours) is slim. The pattern I want to look at is when changes are made to the source data whilst the Full Load package is executing.
FULL LOAD START (FLS) –> CHANGE SOURCE (C1) –> DATA FLOW (DFT) –> CHANGE SOURCE (C2) –> FULL LOAD END (FLE) –> CHANGE SOURCE (C3)
Let me explain the problem with the above pattern if we fail to understand what is going on.
The extract in DFT is normally a very simple
SELECT <column List> FROM <table>
Let us imagine that in our step C1 we introduce 10 rows to the source and in step C2 we delete 3 rows from the source and introduce 10 more rows. Finally in step C3 we introduce a further 10 rows to the source.
(the above is just for example as the changes could be anything)
The problem with this is that the DFT step will pick up the newly introduced rows in step C1 and transport them to the destination. These changes will also be picked up as changes and deposited in our capture instance table ready for when the CDC package runs later. After the full load finishes the destination will have the original rows in the source + 10 newly inserted rows. The change table will have 30 rows representing the 30 newly inserted rows and 3 rows relating to the deletes.
On running the CDC package for the first time the changes made in C1 and C2 will come through potentially reintroducing 10 rows into the destination that exist already. The changes made in C3 will not come through in the first run of the CDC package after a full load but they will come through in the second.
Have a quick read of this article
In that article we talk about needing to execute the CDC package twice after the full load to see changes made at the source after the full load has finished. This is because the first run of the CDC package after the Full load has finished is reserved for changes made during the Full load.
Back to our problem with reintroducing 10 rows to the destination. How do we deal with this?
Well, from the above we know that on the first run of the CDC package we are only going to see the changes made in C1 and C2. In the CDC package on the CDC Source Adapter we can choose to add an attribute "__$reprocessing" by checking the box "Include reprocessing indicator column". This attribute has two uses. Its value will be true when:
- The CDC processing range overlaps with the initial processing range (the range of LSNs corresponding to the period of the initial load) or
2. When a CDC processing range is reprocessed following an error in a previous run.
You can now use this attribute therefore to perform further logic on the incoming data. When the value is true perform a lookup against the destination to check the keys and values in the rows and decide what to do from there perhaps.
My advice for CDC using SSIS will be that as part of the initial phase (The full load) you couple that with a CDC load. After that then you simply schedule the CDC loads as normal
Hopefully this article will help you to understand what is going on in your CDC packages.