Postgres
Overview
This page contains the setup guide and reference information for the Postgres source connector.
Features
Feature | Supported | Notes |
---|---|---|
Full Refresh Sync | Yes | |
Incremental Sync | Yes | |
Change Data Capture | Yes |
Data type mapping
Postgres Type | Resulting Type | Notes |
---|---|---|
bigint | float | |
bigserial , serial8 | float | |
bit | string | Fixed-length bit string (e.g. "0100"). |
bit varying , varbit | string | Variable-length bit string (e.g. "0100"). |
boolean , bool | boolean | |
box | string | |
bytea | string | Variable length binary string with hex output format prefixed with "\x" (e.g. "\x6b707a"). |
character , char | string | |
character varying , varchar | string | |
cidr | string | |
circle | string | |
date | date | Parsed as ISO8601 date time at midnight. |
double precision , float , float8 | float | Infinity , -Infinity , and NaN are not supported and converted to null . |
hstore | string | |
inet | string | |
integer , int , int4 | float | |
interval | string | |
json | string | |
jsonb | string | |
line | string | |
lseg | string | |
macaddr | string | |
macaddr8 | string | |
money | float | |
numeric , decimal | float | Infinity , -Infinity , and NaN are not supported and converted to null . |
path | string | |
pg_lsn | string | |
point | string | |
polygon | string | |
real , float4 | number | |
smallint , int2 | number | |
smallserial , serial2 | number | |
serial , serial4 | number | |
text | string | |
time | string | Parsed as a time string without a time-zone in the ISO-8601 calendar system. |
timetz | string | Parsed as a time string with time-zone in the ISO-8601 calendar system. |
timestamp | timestamp | Parsed as a date-time string without a time-zone in the ISO-8601 calendar system. |
timestamptz | timestamp | Parsed as a date-time string with time-zone in the ISO-8601 calendar system. |
tsquery | string | |
tsvector | string | |
uuid | string | |
xml | string | |
enum | string | |
tsrange | string | |
array | string | E.g. "["10001","10002","10003","10004"]". |
composite type | string |
Getting started
Requirements and prerequisites
- Use Postgres v9.3.x or above
- Create a dedicated read-only user
- Whitelist Y42's production IP addresses.
Setup guide
Step 1: (Optional) Create a dedicated read-only user
We recommend creating a dedicated read-only user for better permission control and auditing. Alternatively, you can use an existing Postgres user in your database.
-
To create a dedicated user, run the following command:
_10CREATE USER <user_name> PASSWORD 'your_password_here'; -
Grant access to the relevant schema:
_10GRANT USAGE ON SCHEMA <schema_name> TO <user_name> -
Grant the user read-only access to the relevant tables:
_10GRANT SELECT ON ALL TABLES IN SCHEMA <schema_name> TO <user_name>; -
Allow user to see tables created in the future:
_10ALTER DEFAULT PRIVILEGES IN SCHEMA <schema_name> GRANT SELECT ON TABLES TO <user_name>;
Step 2: Set up the Postgres source
- Create a new source, select Postgres (Airbyte) from the Source type dropdown and enter a name for it.
- On the source setup page, create a new Postgres secret, enter a unique name for it.
- For the Host, Port, and DB Name, enter the hostname, port number, and name for your Postgres database.
- For User and Password, enter the username and password you created in Step 1.
- (Optional) If you want to use CDC replication select Logical Replication (CDC) as the replication method and enter the names of the replication slot and publication you created.
- (Optional) If you want to connect via SSH tunnel, select "SSH Tunnel Method": SSH Key Authentication or Password Authentication.
Step 3: (Optional) Set up CDC
Using CDC based replication is advantageous if your database is very large and if your table has a primary key but no cursor field for syncing incrementally.
-
Turn on replication for your postgres database by setting the wal_level to logical in your postgresql.conf file:
- If your database is hosted on Google Cloud SQL you need to set the cloudsql.logical_decoding flag to on.
- If your database is hosted on AWS you need to set the rds.logical_replication parameter to 1.
-
Make sure to execute the following postgres commands with a user that has the role replication.
_10ALTER USER <user_name> REPLICATION; -
Create a replication slot
_10SELECT pg_create_logical_replication_slot('slot1', 'pgoutput'); -
Create a publication including all tables you want to sync
_10CREATE PUBLICATION pub1 FOR TABLE <table1, table2>;
Step 4: (Optional) Connect via SSH tunnel
In case of an SSH connection, select "SSH Tunnel Method" other than "No Tunnel" and configure SSH tunneling:
- For SSH Tunnel Jump Server Host, enter the hostname or IP address for the intermediate (bastion) server that Y42 will connect to.
- For SSH Connection Port, enter the port on the bastion server. The default port for SSH connections is 22.
- For SSH Login Username, enter the username to use when connecting to the bastion server. Note: This is the operating system username and not the Postgres username.
- For authentication:
- If you selected SSH Key Authentication, set the SSH Private Key to the private key that you are using to create the SSH connection.
To generate a private key for SSH Tunneling, execute the following command:
The command produces the private key in PEM format and the public key remains in the standard format used by the_10ssh-keygen -t rsa -m PEM -f myuser_rsa
authorized_keys
file on your bastion server. Add the public key to your bastion host to the user you want to use for data sync. The private key is provided via copy-and-paste to the configuration screen. - If you selected Password Authentication, enter the password for the operating system user to connect to the bastion server. Note: This is the operating system password and not the Postgres password.
- If you selected SSH Key Authentication, set the SSH Private Key to the private key that you are using to create the SSH connection.
To generate a private key for SSH Tunneling, execute the following command:
Limitations
- The Postgres source connector currently does not handle schemas larger than 4MB.
- The Postgres source connector does not alter the schema present in your database. Depending on the destination connected to this source, however, the schema may be altered. See the destination's documentation for more details.
- The following two schema evolution actions are currently supported:
- Adding/removing tables without resetting the entire connection at the destination
- Resetting a single table within the connection without resetting the rest of the destination tables in that connection
- Changing a column data type or removing a column might break connections.
- Syncing a subset of columns: currently, there is no way to sync a subset of columns using the Postgres source connector. If the DB user can only access a subset of columns, the connection check will pass. However, the data sync will fail with a permission denied exception. The workaround for partial table syncing is to create a view on the specific columns, and grant the user read access to that view.
- For CDC, you must connect to primary/master databases. Pointing the connector configuration to replica database hosts for CDC will lead to failures.
Troubleshooting
Sync data from Postgres hot standby server
When the connector is reading from a Postgres replica that is configured as a Hot Standby, any update from the primary server will terminate queries on the replica after a certain amount of time, default to 30 seconds. This default waiting time is not enough to sync any meaning amount of data. See the Handling Query Conflicts
section in the Postgres documentation (opens in a new tab) for detailed explanation.
Here is the typical exception:
_10Caused by: org.postgresql.util.PSQLException: FATAL: terminating connection due to conflict with recovery_10 Detail: User query might have needed to see row versions that must be removed._10 Hint: In a moment you should be able to reconnect to the database and repeat your command.
Possible solutions include:
- [Recommended] Set
hot_standby_feedback
(opens in a new tab) totrue
on the replica server. This parameter will prevent the primary server from deleting the write-ahead logs when the replica is busy serving user queries. However, the downside is that the write-ahead log will increase in size. - [Recommended] Sync data when there is no update running in the primary server, or sync data from the primary server.
- [Not Recommended] Increase
max_standby_archive_delay
(opens in a new tab) andmax_standby_streaming_delay
(opens in a new tab) to be larger than the amount of time needed to complete the data sync. However, it is usually hard to tell how much time it will take to sync all the data. This approach is not very practical.
Temporary File Size Limit
Some larger tables may encounter an error related to the temporary file size limit such as temporary file size exceeds temp_file_limit
. To correct this error increase the temp_file_limit (opens in a new tab).