Postgres

Overview

This page contains the setup guide and reference information for the Postgres source connector.

Features

FeatureSupportedNotes
Full Refresh SyncYes
Incremental SyncYes
Change Data CaptureYes

Data type mapping

Postgres TypeResulting TypeNotes
bigintfloat
bigserial, serial8float
bitstringFixed-length bit string (e.g. "0100").
bit varying, varbitstringVariable-length bit string (e.g. "0100").
boolean, boolboolean
boxstring
byteastringVariable length binary string with hex output format prefixed with "\x" (e.g. "\x6b707a").
character, charstring
character varying, varcharstring
cidrstring
circlestring
datedateParsed as ISO8601 date time at midnight.
double precision, float, float8floatInfinity, -Infinity, and NaN are not supported and converted to null.
hstorestring
inetstring
integer, int, int4float
intervalstring
jsonstring
jsonbstring
linestring
lsegstring
macaddrstring
macaddr8string
moneyfloat
numeric, decimalfloatInfinity, -Infinity, and NaN are not supported and converted to null.
pathstring
pg_lsnstring
pointstring
polygonstring
real, float4number
smallint, int2number
smallserial, serial2number
serial, serial4number
textstring
timestringParsed as a time string without a time-zone in the ISO-8601 calendar system.
timetzstringParsed as a time string with time-zone in the ISO-8601 calendar system.
timestamptimestampParsed as a date-time string without a time-zone in the ISO-8601 calendar system.
timestamptztimestampParsed as a date-time string with time-zone in the ISO-8601 calendar system.
tsquerystring
tsvectorstring
uuidstring
xmlstring
enumstring
tsrangestring
arraystringE.g. "["10001","10002","10003","10004"]".
composite typestring

Getting started

Requirements and prerequisites

  • Use Postgres v9.3.x or above
  • Create a dedicated read-only user
  • Whitelist the Y42 production IP address, which is 35.198.72.34.

Setup guide

Step 1: (Optional) Create a dedicated read-only user

We recommend creating a dedicated read-only user for better permission control and auditing. Alternatively, you can use an existing Postgres user in your database.

  1. To create a dedicated user, run the following command:


    _10
    CREATE USER <user_name> PASSWORD 'your_password_here';

  2. Grant access to the relevant schema:


    _10
    GRANT USAGE ON SCHEMA <schema_name> TO <user_name>

  3. Grant the user read-only access to the relevant tables:


    _10
    GRANT SELECT ON ALL TABLES IN SCHEMA <schema_name> TO <user_name>;

  4. Allow user to see tables created in the future:


    _10
    ALTER DEFAULT PRIVILEGES IN SCHEMA <schema_name> GRANT SELECT ON TABLES TO <user_name>;

Step 2: Set up the Postgres source

  1. Create a new source, select Postgres (Airbyte) from the Source type dropdown and enter a name for it.
  2. On the source setup page, create a new Postgres secret, enter a unique name for it.
  3. For the Host, Port, and DB Name, enter the hostname, port number, and name for your Postgres database.
  4. For User and Password, enter the username and password you created in Step 1.
  5. (Optional) If you want to use CDC replication select Logical Replication (CDC) as the replication method and enter the names of the replication slot and publication you created.
  6. (Optional) If you want to connect via SSH tunnel, select "SSH Tunnel Method": SSH Key Authentication or Password Authentication.

Step 3: (Optional) Set up CDC

Using CDC based replication is advantageous if your database is very large and if your table has a primary key but no cursor field for syncing incrementally.

  1. Turn on replication for your postgres database by setting the wal_level to logical in your postgresql.conf file:

    • If your database is hosted on Google Cloud SQL you need to set the cloudsql.logical_decoding flag to on.
    • If your database is hosted on AWS you need to set the rds.logical_replication parameter to 1.
  2. Make sure to execute the following postgres commands with a user that has the role replication.


    _10
    ALTER USER <user_name> REPLICATION;

  3. Create a replication slot


    _10
    SELECT pg_create_logical_replication_slot('slot1', 'pgoutput');

  4. Create a publication including all tables you want to sync


    _10
    CREATE PUBLICATION pub1 FOR TABLE <table1, table2>;

Step 4: (Optional) Connect via SSH tunnel

In case of an SSH connection, select "SSH Tunnel Method" other than "No Tunnel" and configure SSH tunneling:

  1. For SSH Tunnel Jump Server Host, enter the hostname or IP address for the intermediate (bastion) server that Y42 will connect to.
  2. For SSH Connection Port, enter the port on the bastion server. The default port for SSH connections is 22.
  3. For SSH Login Username, enter the username to use when connecting to the bastion server. Note: This is the operating system username and not the Postgres username.
  4. For authentication:
    • If you selected SSH Key Authentication, set the SSH Private Key to the private key that you are using to create the SSH connection. To generate a private key for SSH Tunneling, execute the following command:

      _10
      ssh-keygen -t rsa -m PEM -f myuser_rsa

      The command produces the private key in PEM format and the public key remains in the standard format used by the authorized_keys file on your bastion server. Add the public key to your bastion host to the user you want to use for data sync. The private key is provided via copy-and-paste to the configuration screen.
    • If you selected Password Authentication, enter the password for the operating system user to connect to the bastion server. Note: This is the operating system password and not the Postgres password.

Limitations

  • The Postgres source connector currently does not handle schemas larger than 4MB.
  • The Postgres source connector does not alter the schema present in your database. Depending on the destination connected to this source, however, the schema may be altered. See the destination's documentation for more details.
  • The following two schema evolution actions are currently supported:
    • Adding/removing tables without resetting the entire connection at the destination
    • Resetting a single table within the connection without resetting the rest of the destination tables in that connection
  • Changing a column data type or removing a column might break connections.
  • Syncing a subset of columns: currently, there is no way to sync a subset of columns using the Postgres source connector. If the DB user can only access a subset of columns, the connection check will pass. However, the data sync will fail with a permission denied exception. The workaround for partial table syncing is to create a view on the specific columns, and grant the user read access to that view.
  • For CDC, you must connect to primary/master databases. Pointing the connector configuration to replica database hosts for CDC will lead to failures.

Troubleshooting

Sync data from Postgres hot standby server

When the connector is reading from a Postgres replica that is configured as a Hot Standby, any update from the primary server will terminate queries on the replica after a certain amount of time, default to 30 seconds. This default waiting time is not enough to sync any meaning amount of data. See the Handling Query Conflicts section in the Postgres documentation (opens in a new tab) for detailed explanation.

Here is the typical exception:


_10
Caused by: org.postgresql.util.PSQLException: FATAL: terminating connection due to conflict with recovery
_10
Detail: User query might have needed to see row versions that must be removed.
_10
Hint: In a moment you should be able to reconnect to the database and repeat your command.

Possible solutions include:

  • [Recommended] Set hot_standby_feedback (opens in a new tab) to true on the replica server. This parameter will prevent the primary server from deleting the write-ahead logs when the replica is busy serving user queries. However, the downside is that the write-ahead log will increase in size.
  • [Recommended] Sync data when there is no update running in the primary server, or sync data from the primary server.
  • [Not Recommended] Increase max_standby_archive_delay (opens in a new tab) and max_standby_streaming_delay (opens in a new tab) to be larger than the amount of time needed to complete the data sync. However, it is usually hard to tell how much time it will take to sync all the data. This approach is not very practical.

Temporary File Size Limit

Some larger tables may encounter an error related to the temporary file size limit such as temporary file size exceeds temp_file_limit. To correct this error increase the temp_file_limit (opens in a new tab).