Microsoft SQL Server (MSSQL)
Overview
This page contains the setup guide and reference information for the Microsoft SQL Server (MSSQL) source connector.
Features
Feature Name | Supported |
---|---|
Full Import | Yes |
Incremental Import | Yes |
Change data capture | Yes |
SSL Support | Yes |
SSH Tunnel Connection | Yes |
Data types mapping
MSSQL Type | Resulting Type | Notes |
---|---|---|
bigint | number | |
binary | string | |
bit | boolean | |
char | string | |
date | number | |
datetime | string | |
datetime2 | string | |
datetimeoffset | string | |
decimal | number | |
int | number | |
float | number | |
geography | string | |
geometry | string | |
money | number | |
numeric | number | |
ntext | string | |
nvarchar | string | |
nvarchar(max) | string | |
real | number | |
smalldatetime | string | |
smallint | number | |
smallmoney | number | |
sql_variant | string | |
uniqueidentifier | string | |
text | string | |
time | string | |
tinyint | number | |
varbinary | string | |
varchar | string | |
varchar(max) COLLATE Latin1_General_100_CI_AI_SC_UTF8 | string | |
xml | string |
If you do not see a type in this list, assume that it is coerced into a string.
Getting started
Requirements and prerequisites
- MSSQL Server
Azure SQL Database
,Azure Synapse Analytics
,Azure SQL Managed Instance
,SQL Server 2019
,SQL Server 2017
,SQL Server 2016
,SQL Server 2014
,SQL Server 2012
,PDW 2008R2 AU34
. - A dedicated read-only user with access to all tables needed for replication.
Set up guide
- Create a new source, select MSSQL (Airbyte) from the Source type dropdown and enter a name for it.
- On the source setup page, create a new MSSQL secret, enter a unique name for it.
- Enter the following mandatory secret properties:
Host, Port, Database, Username, Password, SSL Method
- Depending on your preferences, please choose one of the options for:
- SSH Tunnel Method:
No Tunnel
,SSH Key Authentication
orPassword Authentication
(you can read how to configure SSH secret below). - Replication Method:
Standard
orLogical Replication (CDC)
(you can read how to enable and configure CDC replication below).
- SSH Tunnel Method:
Connection to MSSQL via an SSH Tunnel
The reason you might want to do use SSH tunneling is that it is not possible (or against security policy) to connect to the database directly (e.g. it does not have a public IP address).
When using an SSH tunnel, you are configuring a source to connect to an intermediate server (a.k.a. a bastion sever) that does have direct access to the database. A source connects to the bastion and then asks the bastion to connect directly to the server.
Using this feature requires additional configuration, when creating the source. We will talk through what each piece of configuration means.
- Configure all fields for the source as you normally would, except
SSH Tunnel Method
. SSH Tunnel Method
defaults toNo Tunnel
(meaning a direct connection). If you want to use an SSH Tunnel chooseSSH Key Authentication
orPassword Authentication
.- Choose
Key Authentication
if you will be using an RSA private key as your secret for establishing the SSH Tunnel (see below for more information on generating this key). - Choose
Password Authentication
if you will be using a password as your secret for establishing the SSH Tunnel.
- Choose
SSH Tunnel Jump Server Host
refers to the intermediate (bastion) server that a source will connect to. This should be a hostname or an IP Address.SSH Connection Port
is the port on the bastion server with which to make the SSH connection. The default port for SSH connections is22
, so unless you have explicitly changed something, go with the default.SSH Login Username
is the username that a source should use when connection to the bastion server. This is NOT the MSSQL username.- If you are using
Password Authentication
, thenSSH Login Username
should be set to the password of the User from the previous step. If you are usingSSH Key Authentication
leave this blank. Again, this is not the MSSQL password, but the password for the OS-user that a source is using to perform commands on the bastion. - If you are using
SSH Key Authentication
, thenSSH Private Key
should be set to the RSA private Key that you are using to create the SSH connection. This should be the full contents of the key file starting with-----BEGIN RSA PRIVATE KEY-----
and ending with-----END RSA PRIVATE KEY-----
.
Generating an SSH Key Pair
The connector expects an RSA key in PEM format. To generate this key:
_10ssh-keygen -t rsa -m PEM -f myuser_rsa
This produces the private key in pem format, and the public key remains in the standard format used by the authorized_keys
file on your bastion host. The public key should be added to your bastion host to whichever user you want to use. The private key is provided via copy-and-paste to a source configuration screen, so it may log in to the bastion.
Change data capture (CDC)
We use SQL Server's change data capture feature (opens in a new tab) to capture row-level INSERT
, UPDATE
and DELETE
operations that occur on cdc-enabled tables.
CDC setup requires at least db_owner permissions on the database(s) you intend to sync from will be required (detailed below).
- If you need a record of deletions and can accept the limitations of a CDC sync.
- If your data set is small and/or you just want a snapshot of your table, consider using Full Refresh replication for your table instead of CDC.
- If the limitations prevent you from using CDC and your goal is to maintain a snapshot of your table , consider using non-CDC incremental and occasionally reset the data and re-sync.
- If your table has a primary key but doesn't have a reasonable cursor field for incremental syncing (i.e.
updated_at
), CDC allows you to sync your table incrementally.
Parameter | Type | Default | Description |
---|---|---|---|
Data to Sync | Enum: Existing and New , New Changes Only | Existing and New | What data should be synced under the CDC. Existing and New will read existing data as a snapshot, and sync new changes through CDC. New Changes Only will skip the initial snapshot, and only sync new changes through CDC. See documentation here (opens in a new tab) for details. Under the hood, this parameter sets the snapshot.mode in Debezium. |
Snapshot Isolation Level | Enum: Snapshot , Read Committed | Snapshot | Mode to control which transaction isolation level is used and how long the connector locks tables that are designated for capture. If you don't know which one to choose, just use the default one. See documentation here (opens in a new tab) for details. Under the hood, this parameter sets the snapshot.isolation.mode in Debezium. |
1. Enable CDC on database and tables
MS SQL Server provides some built-in stored procedures to enable CDC.
-
To enable CDC, a SQL Server administrator with the necessary privileges (db_owner or sysadmin) must first run a query to enable CDC at the database level.
_10USE {database name}_10GO_10EXEC sys.sp_cdc_enable_db_10GO -
The administrator must then enable CDC for each table that you want to capture. Here's an example:
_10USE {database name}_10GO_10_10EXEC sys.sp_cdc_enable_table_10@source_schema = N'{schema name}',_10@source_name = N'{table name}',_10@role_name = N'{role name}', [1]_10@filegroup_name = N'{fiilegroup name}', [2]_10@supports_net_changes = 0 [3]_10GO- [1] Specifies a role which will gain
SELECT
permission on the captured columns of the source table. We suggest putting a value here so you can use this role in the next step but you can also set the value of @rolename toNULL
to allow only _sysadmin and db_owner to have access. Be sure that the credentials used to connect to the source align with this role so that a source can access the cdc tables. - [2] Specifies the filegroup where SQL Server places the change table. We recommend creating a separate filegroup for CDC but you can leave this parameter out to use the default filegroup.
- [3] If 0, only the support functions to query for all changes are generated. If 1, the functions that are needed to query for net changes are also generated. If supports_net_changes is set to 1, index_name must be specified, or the source table must have a defined primary key.
- [1] Specifies a role which will gain
-
(For more details on parameters, see the Microsoft doc page (opens in a new tab) for this stored procedure).
-
If you have many tables to enable CDC on and would like to avoid having to run this query one-by-one for every table, this script (opens in a new tab) might help!
For further detail, see the Microsoft docs on enabling and disabling CDC (opens in a new tab).
2. Enable snapshot isolation
-
When a sync runs for the first time using CDC, it performs an initial consistent snapshot of your database. To avoid acquiring table locks, a source uses snapshot isolation, allowing simultaneous writes by other database clients. This must be enabled on the database like so:
_10ALTER DATABASE {database name}_10SET ALLOW_SNAPSHOT_ISOLATION ON;
3. Create a user and grant appropriate permissions
-
Rather than use sysadmin or db_owner credentials, we recommend creating a new user with the relevant CDC access. First let's create the login and user and add to the db_datareader (opens in a new tab) role:
_10USE {database name};_10CREATE LOGIN {user name}_10WITH PASSWORD = '{password}';_10CREATE USER {user name} FOR LOGIN {user name};_10EXEC sp_addrolemember 'db_datareader', '{user name}';-
Add the user to the role specified earlier when enabling cdc on the table(s):
_10EXEC sp_addrolemember '{role name}', '{user name}'; -
This should be enough access, but if you run into problems, try also directly granting the user
SELECT
access on the cdc schema:_10USE {database name};_10GRANT SELECT ON SCHEMA :: [cdc] TO {user name}; -
If feasible, granting this user 'VIEW SERVER STATE' permissions will allow to check whether or not the SQL Server Agent (opens in a new tab) is running. This is preferred as it ensures syncs will fail if the CDC tables are not being updated by the Agent in the source database.
_10USE master;_10GRANT VIEW SERVER STATE TO {user name};
-
4. Extend the retention period of CDC data
-
In SQL Server, by default, only three days of data are retained in the change tables. Unless you are running very frequent syncs, we suggest increasing this retention so that in case of a failure in sync or if the sync is paused, there is still some bandwidth to start from the last point in incremental sync.
-
These settings can be changed using the stored procedure sys.sp_cdc_change_job (opens in a new tab) as below:
_10-- we recommend 14400 minutes (10 days) as retention period_10EXEC sp_cdc_change_job @job_type='cleanup', @retention = {minutes} -
After making this change, a restart of the cleanup job is required:
_10 EXEC sys.sp_cdc_stop_job @job_type = 'cleanup';_10_10 EXEC sys.sp_cdc_start_job @job_type = 'cleanup';
5. Ensure the SQL Server Agent is running
- MSSQL uses the SQL Server Agent to run the jobs necessary (opens in a new tab) for CDC. It is therefore vital that the Agent is operational in order for to CDC to work effectively. You can check the status of the SQL Server Agent as follows:
_10 EXEC xp_servicecontrol 'QueryState', N'SQLServerAGENT';
- If you see something other than 'Running.' please follow the Microsoft docs (opens in a new tab) to start the service.
Limitations
- You may run into an issue where the connector provides wrong values for some data types. See discussion (opens in a new tab) on unexpected behaviour for certain datatypes.
- Hierarchyid and sql_variant types are not processed in CDC migration type.
CDC limitations
- There are some critical issues regarding certain datatypes. Please find detailed info in this Github issue (opens in a new tab).
- CDC is only available for SQL Server 2016 Service Pack 1 (SP1) and later.
- db_owner (or higher) permissions are required to perform the neccessary setup for CDC.
- If you set
Initial Snapshot Isolation Level
toSnapshot
, you must enable snapshot isolation mode (opens in a new tab) on the database(s) you want to sync. This is used for retrieving an initial snapshot without locking tables. - For SQL Server Always On read-only replica, only
Snapshot
initial snapshot isolation level is supported. - On Linux, CDC is not supported on versions earlier than SQL Server 2017 CU18 (SQL Server 2019 is supported).
- Change data capture cannot be enabled on tables with a clustered columnstore index. (It can be enabled on tables with a non-clustered columnstore index).
- The SQL Server CDC feature processes changes that occur in user-created tables only. You cannot enable CDC on the SQL Server master database.
- Using variables with partition switching on databases or tables with change data capture (CDC) is not supported for the
ALTER TABLE
...SWITCH TO
...PARTITION
... statement - Implementation has not been tested with managed instances, such as Azure SQL Database.
- If you do want to try this, CDC can only be enabled on Azure SQL databases tiers above Standard 3 (S3+). Basic, S0, S1 and S2 tiers are not supported for CDC.
- CDC implementation uses at least once delivery for all change records.
- Read more on CDC limitations in the Microsoft docs (opens in a new tab).