File import

Overview

This page contains the setup guide and reference information for the File import source connector.

Features

FeatureSupportedNotes
Full Refresh SyncYes
Incremental SyncNo
Replicate Incremental DeletesNo
Replicate Folders (multiple Files)No
Replicate Glob Patterns (multiple Files)No

File / Stream Compression

CompressionSupported?
GzipYes
ZipNo
Bzip2No
LzmaNo
XzNo
SnappyNo

Storage Providers

Storage ProvidersSupported?
HTTPSYes
Google Cloud StorageYes
Amazon Web Services S3Yes
SFTPYes
SSH / SCPYes
local filesystemNo

File Formats

FormatSupported?
CSVYes
JSONYes
HTMLNo
XMLNo
ExcelYes
Excel Binary WorkbookYes
FeatherYes
ParquetYes
PickleNo
YAMLYes

Getting started

Requirements

For utilizing the File import source, ensure you have the following:

  • File access URL
  • File format
  • Reader configurations
  • Storage Providers

See the setup guide for more information about how to create the required resources.

Setting up the File import source

  1. Select the File Import source.
  2. Specify the Storage provider, URL, and the File format.
  3. Based on the Storage provider selected additional settings will need to be configured.

Fields description

  • For Dataset Name, enter the desired name of the final table where this file will be replicated (only letters, numbers, dashes, and underscores are allowed).
  • For File Format select the format of the file you want to replicate
  • For Reader Options, input a JSON formatted string. This will provide additional configurations depending on the file format chosen. For instance, {} for no extra options, {"sep": " "} to set the separator as one space ' '.
  • For URL provide the URL path (not a directory path) to access the file that needs to be replicated (use s3:// for S3 and gs:// for GCS respectively).
  • For Storage Provider, specify the Storage Provider or Location where the file(s) to be replicated are located. Options include:
    • Public Web [Default]
      • Activate User-Agent if you want to include User-Agent in your requests
    • GCS: Google Cloud Storage
      • [Private buckets] Provide service account JSON credentials with the right permissions as described here (opens in a new tab). Generate the credentials.json file and copy its content to this field (expecting JSON formats).
      • [Public buckets] No configurations are required.
    • S3: Amazon Web Services
      • [Private buckets] Provide the AWS Access Key ID and the AWS Secret Access Key
      • [Public buckets] No further configurations are required.
    • AzBlob: Azure Blob Storage
    • SSH: Secure Shell
      • Provide username, password, host, and port for your host.
    • SCP: Secure copy protocol
      • Provide username, password, host, and port for your host.
    • SFTP: Secure File Transfer Protocol
      • Provide username, password, host, and port for your host.

Reader options

The reading mechanism of this source connector is based on Pandas IO Tools (opens in a new tab), which provides a flexible and efficient way to load files into a Pandas DataFrame. This can be customized using the reader_options, which is a JSON format parameter that varies according to the selected file format. Refer to the respective Pandas documentation for specific options available for each format.

For instance, if you select the CSV format, you can use options from the read_csv (opens in a new tab) function. This includes options such as:

  • Customizing the delimiter (or sep) for tab-separated files.
  • Ignoring or customizing the header line with header=0 and names respectively.
  • Parsing dates in specified columns.

Here's an example of how you might use reader_options for a CSV file:


_10
{ "sep" : "\t", "header" : 0, "names": ["column1", "column2"], "parse_dates": ["column2"]}

If you opt for the JSON format, options from the read_json (opens in a new tab) reader are available. For example, you can change the orientation of loaded data using {"orient" : "records"} (where the data format is [{column -> value}, … , {column -> value}]).

If you want to read an Excel Binary Workbook, specify the excel_binary format in the File Format selection.

Modifying source column data types

By default, Airbyte attempts to determine the data type of each source column automatically. However, using the reader_options, you can manually override these inferred data types.

For example, to interpret all columns as strings, you can use {"dtype":"string"}. If you want to specify a string data type for a particular column, use the following format: {"dtype" : {"column name": "string"}}. This will force the named column to be parsed as a string, regardless of the data it contains.

Examples

Below are some examples of possible file inputs:

Dataset NameStorageURLReader ImplService AccountDescription
epidemiologyHTTPShttps://storage.googleapis.com/covid19-open-data/v2/latest/epidemiology.csv (opens in a new tab)COVID-19 Public dataset (opens in a new tab) on BigQuery
hr_and_financialsGCSgs://airbyte-vault/financial.csvsmart_open or gcfs{"type": "service_account", "private_key_id": "XXXXXXXX", ...}data from a private bucket, a service account is necessary
landsat_indexGCSgcp-public-data-landsat/index.csv.gzsmart_openUsing smart_open, we don't need to specify the compression (note the gs:// is optional too, same for other providers)

Examples with reader options:

Dataset NameStorageURLReader ImplReader OptionsDescription
landsat_indexGCSgs://gcp-public-data-landsat/index.csv.gzGCFS{"compression": "gzip"}Additional reader options to specify a compression option to read_csv
GDELTS3s3://gdelt-open-data/events/20190914.export.csv{"sep": "\t", "header": null}Here is TSV data separated by tabs without header row from AWS Open Data (opens in a new tab)

Example for SFTP:

Dataset NameStorageUserPasswordHostURLReader OptionsDescription
Test RebextSFTPdemopasswordtest.rebext.net/pub/example/readme.txt{"sep": "\r\n", "header": null, "names": \["text"], "engine": "python"}We use python engine for read_csv in order to handle delimiter of more than 1 character while providing our own column names.

Performance considerations

It's important to note that the JSON implementation may require adjustments to handle more complex catalogs, as it is currently in an experimental state. While simple JSON schemas should function properly, handling of multi-layered nesting may not be optimal.