File import

Overview

This page contains the setup guide and reference information for the File import source connector.

Features

Feature	Supported	Notes
Full Refresh Sync	Yes
Incremental Sync	No
Replicate Incremental Deletes	No
Replicate Folders (multiple Files)	No
Replicate Glob Patterns (multiple Files)	No

File / Stream Compression

Compression	Supported?
Gzip	Yes
Zip	No
Bzip2	No
Lzma	No
Xz	No
Snappy	No

Storage Providers

Storage Providers	Supported?
HTTPS	Yes
Google Cloud Storage	Yes
Amazon Web Services S3	Yes
SFTP	Yes
SSH / SCP	Yes
local filesystem	No

File Formats

Format	Supported?
CSV	Yes
JSON	Yes
HTML	No
XML	No
Excel	Yes
Excel Binary Workbook	Yes
Feather	Yes
Parquet	Yes
Pickle	No
YAML	Yes

Getting started

Requirements

For utilizing the File import source, ensure you have the following:

File access URL
File format
Reader configurations
Storage Providers

See the setup guide for more information about how to create the required resources.

Setting up the File import source

Select the File Import source.
Specify the Storage provider, URL, and the File format.
Based on the Storage provider selected additional settings will need to be configured.

Fields description

For Dataset Name, enter the desired name of the final table where this file will be replicated (only letters, numbers, dashes, and underscores are allowed).
For File Format select the format of the file you want to replicate
For Reader Options, input a JSON formatted string. This will provide additional configurations depending on the file format chosen. For instance, {} for no extra options, {"sep": " "} to set the separator as one space ' '.
For URL provide the URL path (not a directory path) to access the file that needs to be replicated (use s3:// for S3 and gs:// for GCS respectively).
For Storage Provider, specify the Storage Provider or Location where the file(s) to be replicated are located. Options include:
- Public Web [Default]
  - Activate User-Agent if you want to include User-Agent in your requests
- GCS: Google Cloud Storage
  - [Private buckets] Provide service account JSON credentials with the right permissions as described here (opens in a new tab). Generate the credentials.json file and copy its content to this field (expecting JSON formats).
  - [Public buckets] No configurations are required.
- S3: Amazon Web Services
  - [Private buckets] Provide the AWS Access Key ID and the AWS Secret Access Key
  - [Public buckets] No further configurations are required.
- AzBlob: Azure Blob Storage
  - Storage Account This is the globally unique name of the storage account containing your target blob. For more details, click here (opens in a new tab)
  - [Private buckets] Provide a SAS (Shared Access Signature) token (opens in a new tab) or a storage account shared key (opens in a new tab) (also known as account key or access key)
  - [Public buckets] No further configurations are required.
- SSH: Secure Shell
  - Provide username, password, host, and port for your host.
- SCP: Secure copy protocol
  - Provide username, password, host, and port for your host.
- SFTP: Secure File Transfer Protocol
  - Provide username, password, host, and port for your host.

Reader options

The reading mechanism of this source connector is based on Pandas IO Tools (opens in a new tab), which provides a flexible and efficient way to load files into a Pandas DataFrame. This can be customized using the reader_options, which is a JSON format parameter that varies according to the selected file format. Refer to the respective Pandas documentation for specific options available for each format.

For instance, if you select the CSV format, you can use options from the read_csv (opens in a new tab) function. This includes options such as:

Customizing the delimiter (or sep) for tab-separated files.
Ignoring or customizing the header line with header=0 and names respectively.
Parsing dates in specified columns.

Here's an example of how you might use reader_options for a CSV file:


_10{ "sep" : "\t", "header" : 0, "names": ["column1", "column2"], "parse_dates": ["column2"]}

If you opt for the JSON format, options from the read_json (opens in a new tab) reader are available. For example, you can change the orientation of loaded data using {"orient" : "records"} (where the data format is [{column -> value}, … , {column -> value}]).

If you want to read an Excel Binary Workbook, specify the excel_binary format in the File Format selection.

Modifying source column data types

By default, Airbyte attempts to determine the data type of each source column automatically. However, using the reader_options, you can manually override these inferred data types.

For example, to interpret all columns as strings, you can use {"dtype":"string"}. If you want to specify a string data type for a particular column, use the following format: {"dtype" : {"column name": "string"}}. This will force the named column to be parsed as a string, regardless of the data it contains.

Examples

Below are some examples of possible file inputs:

Dataset Name	Storage	URL	Reader Impl	Service Account	Description
epidemiology	HTTPS	https://storage.googleapis.com/covid19-open-data/v2/latest/epidemiology.csv (opens in a new tab)			COVID-19 Public dataset (opens in a new tab) on BigQuery
hr_and_financials	GCS	gs://airbyte-vault/financial.csv	smart_open or gcfs	`{"type": "service_account", "private_key_id": "XXXXXXXX", ...}`	data from a private bucket, a service account is necessary
landsat_index	GCS	gcp-public-data-landsat/index.csv.gz	smart_open		Using smart_open, we don't need to specify the compression (note the gs:// is optional too, same for other providers)

Examples with reader options:

Dataset Name	Storage	URL	Reader Impl	Reader Options	Description
landsat_index	GCS	gs://gcp-public-data-landsat/index.csv.gz	GCFS	`{"compression": "gzip"}`	Additional reader options to specify a compression option to `read_csv`
GDELT	S3	s3://gdelt-open-data/events/20190914.export.csv		`{"sep": "\t", "header": null}`	Here is TSV data separated by tabs without header row from AWS Open Data (opens in a new tab)

Example for SFTP:

Dataset Name	Storage	User	Password	Host	URL	Reader Options	Description
Test Rebext	SFTP	demo	password	test.rebext.net	/pub/example/readme.txt	`{"sep": "\r\n", "header": null, "names": \["text"], "engine": "python"}`	We use `python` engine for `read_csv` in order to handle delimiter of more than 1 character while providing our own column names.

Performance considerations

It's important to note that the JSON implementation may require adjustments to handle more complex catalogs, as it is currently in an experimental state. While simple JSON schemas should function properly, handling of multi-layered nesting may not be optimal.

Faker Google Cloud Storage