SSIS CDC: Changes Made To Source During Full Load

Posted on June 7, 2012. Filed under: CDC, SSIS | Tags: , |

Carrying on with our look at CDC I want to introduce a common pattern that most people will see in the course of a day and also look at how this is dealt with using Change Data Capture in SSIS.

CDC is usually executed in two parts.  The first part is the full load which essentially takes all the data in the source and deposits it into the destination. The second part is the CDC package which identifies changes in your source data and by means of the CDC Source Adapter introduces them into the data flow.  This part is usually executed on a schedule through something like SQL Server Agent.

Most, if not all, of the examples you see shown doing anything with CDC in SSIS revolve around this pattern

Part 1
FULL LOAD START –> DATA FLOW –> FULL LOAD END

Part 2
CHANGES TO SOURCE DATA
CDC PACKAGE START –> DATA FLOW –> CDC PACKAGE END
<REPEAT ABOVE 2 STEPS>
 

This is fine for simple demonstrations of what happens however the chances of your source table standing still during the initial load (remember this could potentially take hours) is slim.  The pattern I want to look at is when changes are made to the source data whilst the Full Load package is executing.

FULL LOAD START (FLS) –> CHANGE SOURCE (C1) –> DATA FLOW (DFT) –> CHANGE SOURCE (C2) –> FULL LOAD END (FLE) –> CHANGE SOURCE (C3)

Let me explain the problem with the above pattern if we fail to understand what is going on.

The extract in DFT is normally a very simple

SELECT <column List> FROM <table>

Let us imagine that in our step C1 we introduce 10 rows to the source and in step C2 we delete 3 rows from the source and introduce 10 more rows.  Finally in step C3 we introduce a further 10 rows to the source.

(the above is just for example as the changes could be anything)

The problem with this is that the DFT step will pick up the newly introduced rows in step C1 and transport them to the destination.  These changes will also be picked up as changes and deposited in our capture instance table ready for when the CDC package runs later. After the full load finishes the destination will have the original rows in the source + 10 newly inserted rows.  The change table will have 30 rows representing the 30 newly inserted rows and 3 rows relating to the deletes.
On running the CDC package for the first time the changes made in C1 and C2 will come through potentially reintroducing 10 rows into the destination that exist already.  The changes made in C3 will not come through in the first run of the CDC package after a full load but they will come through in the second.

Have a quick read of this article

In that article we talk about needing to execute the CDC package twice after the full load to see changes made at the source after the full load has finished.  This is because the first run of the CDC package after the Full load has finished is reserved for changes made during the Full load.

Back to our problem with reintroducing 10 rows to the destination.  How do we deal with this?

Well, from the above we know that on the first run of the CDC package we are only going to see the changes made in C1 and C2.  In the CDC package on the CDC Source Adapter we can choose to add an attribute "__$reprocessing" by checking the box "Include reprocessing indicator column".  This attribute has two uses.  Its value will be true when:

  1. The CDC processing range overlaps with the initial processing range (the range of LSNs corresponding to the period of the initial load) or
    2.  When a CDC processing range is reprocessed following an error in a previous run.

You can now use this attribute therefore to perform further logic on the incoming data.  When the value is true perform a lookup against the destination to check the keys and values in the rows and decide what to do from there perhaps.

My advice for CDC using SSIS will be that as part of the initial phase (The full load) you couple that with a CDC load.  After that then you simply schedule the CDC loads as normal

Hopefully this article will help you to understand what is going on in your CDC packages.

3 Responses to “SSIS CDC: Changes Made To Source During Full Load”

RSS Feed for Copper Blue Consulting Comments RSS Feed

Hi ,
Good post on CDC in SSIS for 2012. I am noticing one scenario that is an issue and maybe you can help.

My setup:
• I have 2 source tables with CDC setup
• I created an initial load job to: CDC ‘Mark Initial Load’, Load the 2 tables from source, CDC ‘Mark initial load end’
• I created one incremental load package with two dataflows. One dataflow to process the CDC for each table.
• The incremental load has the CDC ‘Get Procesisng Range’, Run two dataflows, CDC ‘Mark Processing Range’

This works great. I run the incremental load job on a daily basis and everything works fine. Day 1 and 2 captured the changes in my two target table

Scenario (day 3):
• A third table is added to CDC in source.
• Many changes are also happening in source on the two tables already setup with CDC
• I ran the initial load job (same as above except only loading the new table) on the new target table.
• I added a third dataflow to my incremental job to capture daily CDC on the new table.
• That night on day 3 the incremental load runs and the changes to the 2 old tables are not processed.
• On day four changes to all three tables are processed fine again.

Assumption: It seems the initial load on the new table shifted the CDC_State values and all changes for that day were in the past (not picked up)

Question: Do you know what could be happening when on days when adding new tables and doing initial load on them?

Thanks

Paul

By re-running the “Full” package even though you only move the “3rd” table you bump the lsn in your state table. seeing as the changes to table1 and table2 did not happen within the period of the re-run of the full load then CDC has no idea that they happened (it does because they will be in your change table but you have essentially told CDC to skip them)

Does this help?

Allan

Excellant Allan. Thank You. I guess my solution is to run the incemental CDC for my other two tables before the initial load of the new third table. This will get me up synchronized on all three tables (my source is static at time of my loads). I then add the third table to my daily incremental package and all should be fine the next day.


Comments are closed.

Liked it here?
Why not try sites on the blogroll...

Follow

Get every new post delivered to your Inbox.

Join 201 other followers