Postgres

Overview

This page contains the setup guide and reference information for the Postgres source connector.

Features

Feature	Supported	Notes
Full Refresh Sync	Yes
Incremental Sync	Yes
Change Data Capture	Yes

Data type mapping

Postgres Type	Resulting Type	Notes
`bigint`	float
`bigserial`, `serial8`	float
`bit`	string	Fixed-length bit string (e.g. "0100").
`bit varying`, `varbit`	string	Variable-length bit string (e.g. "0100").
`boolean`, `bool`	boolean
`box`	string
`bytea`	string	Variable length binary string with hex output format prefixed with "\x" (e.g. "\x6b707a").
`character`, `char`	string
`character varying`, `varchar`	string
`cidr`	string
`circle`	string
`date`	date	Parsed as ISO8601 date time at midnight.
`double precision`, `float`, `float8`	float	`Infinity`, `-Infinity`, and `NaN` are not supported and converted to `null`.
`hstore`	string
`inet`	string
`integer`, `int`, `int4`	float
`interval`	string
`json`	string
`jsonb`	string
`line`	string
`lseg`	string
`macaddr`	string
`macaddr8`	string
`money`	float
`numeric`, `decimal`	float	`Infinity`, `-Infinity`, and `NaN` are not supported and converted to `null`.
`path`	string
`pg_lsn`	string
`point`	string
`polygon`	string
`real`, `float4`	number
`smallint`, `int2`	number
`smallserial`, `serial2`	number
`serial`, `serial4`	number
`text`	string
`time`	string	Parsed as a time string without a time-zone in the ISO-8601 calendar system.
`timetz`	string	Parsed as a time string with time-zone in the ISO-8601 calendar system.
`timestamp`	timestamp	Parsed as a date-time string without a time-zone in the ISO-8601 calendar system.
`timestamptz`	timestamp	Parsed as a date-time string with time-zone in the ISO-8601 calendar system.
`tsquery`	string
`tsvector`	string
`uuid`	string
`xml`	string
`enum`	string
`tsrange`	string
`array`	string	E.g. "["10001","10002","10003","10004"]".
composite type	string

Getting started

Requirements and prerequisites

Use Postgres v9.3.x or above
Create a dedicated read-only user
Whitelist Y42's production IP addresses.

Setup guide

Step 1: (Optional) Create a dedicated read-only user

We recommend creating a dedicated read-only user for better permission control and auditing. Alternatively, you can use an existing Postgres user in your database.

To create a dedicated user, run the following command:

_10CREATE USER <user_name> PASSWORD 'your_password_here';
Grant access to the relevant schema:

_10GRANT USAGE ON SCHEMA <schema_name> TO <user_name>
Grant the user read-only access to the relevant tables:

_10GRANT SELECT ON ALL TABLES IN SCHEMA <schema_name> TO <user_name>;
Allow user to see tables created in the future:

_10ALTER DEFAULT PRIVILEGES IN SCHEMA <schema_name> GRANT SELECT ON TABLES TO <user_name>;

Step 2: Set up the Postgres source

Create a new source, select Postgres (Airbyte) from the Source type dropdown and enter a name for it.
On the source setup page, create a new Postgres secret, enter a unique name for it.
For the Host, Port, and DB Name, enter the hostname, port number, and name for your Postgres database.
For User and Password, enter the username and password you created in Step 1.
(Optional) If you want to use CDC replication select Logical Replication (CDC) as the replication method and enter the names of the replication slot and publication you created.
(Optional) If you want to connect via SSH tunnel, select "SSH Tunnel Method": SSH Key Authentication or Password Authentication.

Step 3: (Optional) Set up CDC

Using CDC based replication is advantageous if your database is very large and if your table has a primary key but no cursor field for syncing incrementally.

Turn on replication for your postgres database by setting the wal_level to logical in your postgresql.conf file:
- If your database is hosted on Google Cloud SQL you need to set the cloudsql.logical_decoding flag to on.
- If your database is hosted on AWS you need to set the rds.logical_replication parameter to 1.
Make sure to execute the following postgres commands with a user that has the role replication.

_10ALTER USER <user_name> REPLICATION;
Create a replication slot

_10SELECT pg_create_logical_replication_slot('slot1', 'pgoutput');
Create a publication including all tables you want to sync

_10CREATE PUBLICATION pub1 FOR TABLE <table1, table2>;

Step 4: (Optional) Connect via SSH tunnel

In case of an SSH connection, select "SSH Tunnel Method" other than "No Tunnel" and configure SSH tunneling:

For SSH Tunnel Jump Server Host, enter the hostname or IP address for the intermediate (bastion) server that Y42 will connect to.
For SSH Connection Port, enter the port on the bastion server. The default port for SSH connections is 22.
For SSH Login Username, enter the username to use when connecting to the bastion server. Note: This is the operating system username and not the Postgres username.
For authentication:
- If you selected SSH Key Authentication, set the SSH Private Key to the private key that you are using to create the SSH connection. To generate a private key for SSH Tunneling, execute the following command:
  _10ssh-keygen -t rsa -m PEM -f myuser_rsa
  The command produces the private key in PEM format and the public key remains in the standard format used by the authorized_keys file on your bastion server. Add the public key to your bastion host to the user you want to use for data sync. The private key is provided via copy-and-paste to the configuration screen.
- If you selected Password Authentication, enter the password for the operating system user to connect to the bastion server. Note: This is the operating system password and not the Postgres password.

Limitations

The Postgres source connector currently does not handle schemas larger than 4MB.
The Postgres source connector does not alter the schema present in your database. Depending on the destination connected to this source, however, the schema may be altered. See the destination's documentation for more details.
The following two schema evolution actions are currently supported:
- Adding/removing tables without resetting the entire connection at the destination
- Resetting a single table within the connection without resetting the rest of the destination tables in that connection
Changing a column data type or removing a column might break connections.
Syncing a subset of columns: currently, there is no way to sync a subset of columns using the Postgres source connector. If the DB user can only access a subset of columns, the connection check will pass. However, the data sync will fail with a permission denied exception. The workaround for partial table syncing is to create a view on the specific columns, and grant the user read access to that view.
For CDC, you must connect to primary/master databases. Pointing the connector configuration to replica database hosts for CDC will lead to failures.

Troubleshooting

Sync data from Postgres hot standby server

When the connector is reading from a Postgres replica that is configured as a Hot Standby, any update from the primary server will terminate queries on the replica after a certain amount of time, default to 30 seconds. This default waiting time is not enough to sync any meaning amount of data. See the Handling Query Conflicts section in the Postgres documentation (opens in a new tab) for detailed explanation.

Here is the typical exception:


_10Caused by: org.postgresql.util.PSQLException: FATAL: terminating connection due to conflict with recovery
_10    Detail: User query might have needed to see row versions that must be removed.
_10    Hint: In a moment you should be able to reconnect to the database and repeat your command.

Possible solutions include:

[Recommended] Set hot_standby_feedback (opens in a new tab) to true on the replica server. This parameter will prevent the primary server from deleting the write-ahead logs when the replica is busy serving user queries. However, the downside is that the write-ahead log will increase in size.
[Recommended] Sync data when there is no update running in the primary server, or sync data from the primary server.
[Not Recommended] Increase max_standby_archive_delay (opens in a new tab) and max_standby_streaming_delay (opens in a new tab) to be larger than the amount of time needed to complete the data sync. However, it is usually hard to tell how much time it will take to sync all the data. This approach is not very practical.

Temporary File Size Limit

Some larger tables may encounter an error related to the temporary file size limit such as temporary file size exceeds temp_file_limit. To correct this error increase the temp_file_limit (opens in a new tab).

MSSQL S3