File import
Overview
This page contains the setup guide and reference information for the File import source connector.
Features
Feature | Supported | Notes |
---|---|---|
Full Refresh Sync | Yes | |
Incremental Sync | No | |
Replicate Incremental Deletes | No | |
Replicate Folders (multiple Files) | No | |
Replicate Glob Patterns (multiple Files) | No |
File / Stream Compression
Compression | Supported? |
---|---|
Gzip | Yes |
Zip | No |
Bzip2 | No |
Lzma | No |
Xz | No |
Snappy | No |
Storage Providers
Storage Providers | Supported? |
---|---|
HTTPS | Yes |
Google Cloud Storage | Yes |
Amazon Web Services S3 | Yes |
SFTP | Yes |
SSH / SCP | Yes |
local filesystem | No |
File Formats
Format | Supported? |
---|---|
CSV | Yes |
JSON | Yes |
HTML | No |
XML | No |
Excel | Yes |
Excel Binary Workbook | Yes |
Feather | Yes |
Parquet | Yes |
Pickle | No |
YAML | Yes |
Getting started
Requirements
For utilizing the File import source, ensure you have the following:
- File access URL
- File format
- Reader configurations
- Storage Providers
See the setup guide for more information about how to create the required resources.
Setting up the File import source
- Select the File Import source.
- Specify the
Storage provider
,URL
, and theFile format
. - Based on the Storage provider selected additional settings will need to be configured.
Fields description
- For
Dataset Name
, enter the desired name of the final table where this file will be replicated (only letters, numbers, dashes, and underscores are allowed). - For
File Format
select the format of the file you want to replicate - For
Reader Options
, input a JSON formatted string. This will provide additional configurations depending on the file format chosen. For instance,{}
for no extra options,{"sep": " "}
to set the separator as one space' '
. - For
URL
provide the URL path (not a directory path) to access the file that needs to be replicated (uses3://
for S3 andgs://
for GCS respectively). - For
Storage Provider
, specify theStorage Provider
orLocation
where the file(s) to be replicated are located. Options include:- Public Web [Default]
- Activate
User-Agent
if you want to include User-Agent in your requests
- Activate
- GCS: Google Cloud Storage
- [Private buckets] Provide
service account JSON
credentials with the right permissions as described here (opens in a new tab). Generate thecredentials.json
file and copy its content to this field (expecting JSON formats). - [Public buckets] No configurations are required.
- [Private buckets] Provide
- S3: Amazon Web Services
- [Private buckets] Provide the
AWS Access Key ID
and theAWS Secret Access Key
- [Public buckets] No further configurations are required.
- [Private buckets] Provide the
- AzBlob: Azure Blob Storage
Storage Account
This is the globally unique name of the storage account containing your target blob. For more details, click here (opens in a new tab)- [Private buckets] Provide a SAS (Shared Access Signature) token (opens in a new tab) or a storage account shared key (opens in a new tab) (also known as account key or access key)
- [Public buckets] No further configurations are required.
- SSH: Secure Shell
- Provide
username
,password
,host
, andport
for your host.
- Provide
- SCP: Secure copy protocol
- Provide
username
,password
,host
, andport
for your host.
- Provide
- SFTP: Secure File Transfer Protocol
- Provide
username
,password
,host
, andport
for your host.
- Provide
- Public Web [Default]
Reader options
The reading mechanism of this source connector is based on Pandas IO Tools (opens in a new tab), which provides a flexible and efficient way to load files into a Pandas DataFrame. This can be customized using the reader_options, which is a JSON format parameter that varies according to the selected file format. Refer to the respective Pandas documentation for specific options available for each format.
For instance, if you select the CSV format, you can use options from the read_csv (opens in a new tab) function. This includes options such as:
- Customizing the delimiter (or sep) for tab-separated files.
- Ignoring or customizing the header line with header=0 and names respectively.
- Parsing dates in specified columns.
Here's an example of how you might use reader_options for a CSV file:
_10{ "sep" : "\t", "header" : 0, "names": ["column1", "column2"], "parse_dates": ["column2"]}
If you opt for the JSON format, options from the read_json (opens in a new tab) reader are available. For example, you can change the orientation of loaded data using {"orient" : "records"}
(where the data format is [{column -> value}, … , {column -> value}]
).
If you want to read an Excel Binary Workbook, specify the excel_binary format in the File Format selection.
Modifying source column data types
By default, Airbyte attempts to determine the data type of each source column automatically. However, using the reader_options, you can manually override these inferred data types.
For example, to interpret all columns as strings, you can use {"dtype":"string"}
. If you want to specify a string data type for a particular column, use the following format: {"dtype" : {"column name": "string"}}
. This will force the named column to be parsed as a string, regardless of the data it contains.
Examples
Below are some examples of possible file inputs:
Dataset Name | Storage | URL | Reader Impl | Service Account | Description |
---|---|---|---|---|---|
epidemiology | HTTPS | https://storage.googleapis.com/covid19-open-data/v2/latest/epidemiology.csv (opens in a new tab) | COVID-19 Public dataset (opens in a new tab) on BigQuery | ||
hr_and_financials | GCS | gs://airbyte-vault/financial.csv | smart_open or gcfs | {"type": "service_account", "private_key_id": "XXXXXXXX", ...} | data from a private bucket, a service account is necessary |
landsat_index | GCS | gcp-public-data-landsat/index.csv.gz | smart_open | Using smart_open, we don't need to specify the compression (note the gs:// is optional too, same for other providers) |
Examples with reader options:
Dataset Name | Storage | URL | Reader Impl | Reader Options | Description |
---|---|---|---|---|---|
landsat_index | GCS | gs://gcp-public-data-landsat/index.csv.gz | GCFS | {"compression": "gzip"} | Additional reader options to specify a compression option to read_csv |
GDELT | S3 | s3://gdelt-open-data/events/20190914.export.csv | {"sep": "\t", "header": null} | Here is TSV data separated by tabs without header row from AWS Open Data (opens in a new tab) |
Example for SFTP:
Dataset Name | Storage | User | Password | Host | URL | Reader Options | Description |
---|---|---|---|---|---|---|---|
Test Rebext | SFTP | demo | password | test.rebext.net | /pub/example/readme.txt | {"sep": "\r\n", "header": null, "names": \["text"], "engine": "python"} | We use python engine for read_csv in order to handle delimiter of more than 1 character while providing our own column names. |
Performance considerations
It's important to note that the JSON implementation may require adjustments to handle more complex catalogs, as it is currently in an experimental state. While simple JSON schemas should function properly, handling of multi-layered nesting may not be optimal.