SQL Server incremental data synchronization

Today I will talk about SQL Server incremental data synchronization. When the company's business data volume is increasing, the data analysis department always wants to make reports or extract useful operational data. Usually, the relevant person in charge wants to synchronize the data of various business systems and various structures to the data warehouse, data lake, etc.

For other types of databases, they are mostly deployed in a Linux environment, and there are more and mature solutions for synchronizing data.

What incremental synchronization schemes does SQL Server have?

Option 1: Change Tracking (Chang Tracking)

Change tracking is a table-level tracking that records only the fact that the row has changed, not the number of row changes or the value of any intermediate changes. Because only the row identifier is stored, this requires that the table must have a primary key. When performing DML operations on a table, the system table will record the primary key and generate its own row version.

Where is the specific row data? Of course, it is still in the original table. As long as the tracked primary key is associated with the original table, the new and last changed rows can be found. For deleted records, the original table is gone, and we know the primary key can be synchronized to the target table for deletion. By recording the version, the most recent operation can be synchronized each time.

image

CT schematic

However, change tracking is rarely used in actual work. I think there are several reasons:

The intermediate value of the change is not recorded, and it does not apply to the data with strict transaction properties.

Frequent DML also affects the frequent DML recorded in the tracking table, and also affects some performance.

Very few people know, after all, almost all third-party synchronization tools are based on change data capture (CDC), not change tracking.

Change tracking is not a kind of incremental synchronization, which is equivalent to synchronizing the current snapshot record.

Option 2: Change Data Capture (CDC)

CDC should be known to many big data developers, and CDC is also table-level data synchronization. CDC also has a table for recording and tracking. When DML is performed on the source table, relevant operation information is recorded in the transaction log. The change data capture agent uses an asynchronous process to read the transaction log, and applies the result of the related operation to the copy table (capture instance table), thus completing the record and tracking of the source table operation. The same applies even if the database recovery model is in simple mode. The principle of CDC is shown in the figure below.

image

CDC principle

For incremental tracking, the application records the corresponding time or transaction log sequence number (LSN) during each operation, and the next read is greater than the LSN. Since it is to read the transaction log of the database, the CDC process logic is embedded in the stored procedure sp_replcmds (transactional replication also uses the same process).

CDC is more commonly used, but when new fields are added to the table, the new fields will not be tracked. The general solution is: after adding fields, enable another CDC tracking instance table for the original table, and delete the original tracking instance table after the original tracking record is synchronized. There is actually another scheme for adding fields. I have a script (link below). When a new field is added, the script in the link is executed, a new temporary CDC tracking instance will be produced, and the original stored procedure will be replaced according to the stored procedure information generated by the instance, and then the new instance will be disabled. In this way, after adding fields, you do not have to wait for the completion of the data synchronization, and continue to use the system objects of the table for the first CDC enabled.

https://github.com/hzc2012/SQLServer/blob/master/Scripts/Sync_CDC_Table_Columns_To_CT

Solution 3: UpdatedTime field

This scheme is also a very common synchronization scheme. A new time field is added to the table, and the default value is the current time GETDATE(). However, this field is not automatically updated like MySQL. Only one update trigger can be added. When the row data is changed, UpdateTime is updated to the current time. The following example.

CREATE TABLE [dbo].[TestTab](  ID INT IDENTITY(1,1) NOT NULL,  Name VARCHAR(50) NOT NULL,  UpdatedTime DATETIME NOT NULL DEFAULT(GETDATE()))GOCREATE TRIGGER tr_TestTab_update  ON dbo.TestTab  AFTER UPDATE   ASBEGIN  UPDATE t SET UpdatedTime=GETDATE()   FROM dbo.TestTab t   INNER JOIN inserted i ON t.id=i.idENDGO

The synchronization of data is to perform incremental query according to time and capture the corresponding change data. For triggers, I believe everyone is more resistant. The triggers have to update the table association every time, which increases the system resource overhead. It is also inappropriate to update a time field every time the table data is updated. How is that good! Next, we use another solution!

Solution 4: rowversion / timestamp

This program is not yet familiar to most students, and it is estimated that only a more professional SQL Server DBA may understand it. Speaking of timestamp, many students may have heard it. In SQL Server, timestamp is just a synonym for rowversion (it can be said to be an alias). timestamp is the name used in earlier versions of SQL Server. Timestamp is obsolete and will be deleted in future versions of SQL Server. Next, let's use rowversion.

Rowversion is the data type of the only binary number automatically generated in the database. Rowversion is usually used as a mechanism for version stamping table rows. The storage size is 8 bytes. The rowversion data type is only an increasing number, and does not retain the date or time. The non-nullable rowversion column is semantically equivalent to the binary(8) column. The nullable rowversion column is semantically equivalent to the varbinary(8) column.

Let's make an example and insert 5 rows of data.

CREATE TABLE TestTab (id INT NOT NULL,rv ROWVERSION );GOINSERT INTO TestTab(id) VALUES(1)GO 5SELECT * FROM TestTabGO id  rv1  0x00000000000007D51  0x00000000000007D61  0x00000000000007D71  0x00000000000007D81  0x00000000000007D9 SELECT @@DBTS AS 当前RV,MIN_ACTIVE_ROWVERSION() AS 下一个可用RV0x00000000000007D9  0x00000000000007DA

Use system variables and functions @@DBTS and MIN_ACTIVE_ROWVERSION() to view the row version number of the database. You can also create multiple tables for verification. The row version is at the database-wide level. However, when incrementally synchronizing data, the row version is still stored according to the table, and the table records are read next time incrementally. (Is it similar to Option Three?)

to sum up:

The incremental data synchronization scheme in SQL Serve is a bit lossy compared to other databases. After all, it is closed source and can only connect to the database for data query operations. However, there are also many third-party tools that do well, and they also perform incremental data extraction based on the above solutions. For example, Oracle GoldenGate and Debezium are based on CDC, while alibaba/DataX for SQL Server is based on query extraction of data. According to different business scenarios or architecture conditions, it is best to choose a solution that suits you.