Ingest data
Before we create a data pipeline, we'll need data to work with. This section explores data assets, the asset lifecycle, and how to set up our first source assets.
Ingestion
Y42 follows the ELT paradigm for processing data: we extract data from a source, then load it, and finally transform it. Extracting data from its source and loading it into a data warehouse is what we call ingestion.
Within Y42, there are over 500 connectors we can use for data ingestion, allowing us to pick and choose the most useful ones. All data ingestion starts with creating a source asset.
About assets
Assets are the building blocks of data pipelines. They are definitions in code that when executed materialize tables in our data warehouse. We call this process a data build. For example, building a source asset results in a source table, and building a staging model results in a staging table (more on those in section 4).
Asset lifecycle
It helps to understand the four lifecycle stages that every asset goes through.
-
Definition: This is the initial phase where we define the asset either directly as code or through the Y42 interface.
-
Version Control and CI/CD: After defining the asset, we save and track changes by committing and pushing the changes to Git. This ensures that the asset is properly stored and that we can fix any breaking changes by reverting to a previous version.
-
Building: When the changes have been successfully saved to the remote Git repository, we can builld the asset with the
y42 build
command. At this point, Y42 creates the table in our data warehouse and runs any data tests we have defined. -
Publication: Once tested and verified, we can publish a specific version of an asset on different branches, such as
main
ordevelop
. This is when the version of an asset becomes the production version that most people in our team will see. We can set read, write, and discover permissions on the organization, space, and asset-type level to restrict access to published assets as necessary.
Create source assets
Now that we have some background knowledge on assets, let's start with defining source assets. The first source we'll ingest is the Space Fuel Shop dataset. This dataset is hosted in a Postgres database and details orders from intergalactic gas stations. We'll use Airbyte's Postgres connector to ingest the data.
Create a working branch
As you may have gathered from the asset lifecycle, Y42 follows the GitOps paradigm. We'll dive further into this in section 6, but let's implement a best practice immediately. We should create a new branch whenever we want to make changes to our pipelines — such as adding a data source. This gives us a safe environment to work in without disrupting live pipelines.
To create a new branch, click on main [default]
at the top center of the screen and type in a name for the new branch (e.g., getting-started-guide
). Now hit enter; that's all there is to it.
Create a new source asset
To ingest the Space Fuel Shop dataset, we'll use Airbyte's Postgres connector:
- In the left panel, select the
List
view underAsset Editor
- Press on
+
in the file navigator - Select
Source
- Select
Ingest
- Select
Postgres (Airbyte)
as the Ingest type - Name the source:
raw_space_fuel_shop_data
Now, we'll configure the source asset to point to the right Postgres database and pull in all of the data.
Add a connection secret
Source assets (generally) need a secret to authorize their connection to the data source.
-
In the
Authorization
panel, click on the search boxSearch a secret
. When no secrets exist yet, you'll be prompted to add a new secret. If secrets do exist, you can select one of them or click onAdd new
. -
Name the secret
postgres_space_fuel_shop_data_{YOUR_NAME}
, replacingYOUR_NAME
with your name. -
Insert the following type-specific configuration variables:
Variable Value Host 35.198.165.208
Port 5432
(default)Database Name y42-demo-data
Username public_reader
Password dontpanic
-
Set
Replication Method
toStandard
-
Set
SSH Tunnel Method
toNo Tunnel
-
Leave
JDBC URL Parameters (Advanced)
empty -
Set
SSL Modes
toallow
When you have entered all authorization settings, click Next
in the Settings
panel, and wait for the schema to load in the Schema selection
panel.
Select the the tables and columns you would like to import
- Open the
Schema
panel. - Wait for the schema to load. For larger datasets this may take a few minutes.
- Select all three tables:
public/payments
,public/orders
, andpublic/customers
- You can open the dropdown of a single table to see the individual columns and types. For now, just import all columns.
Commit the new source assets
Our first source is now fully configured. Before we can build the tables and ingest the data, we'll need to commit and push our changes.
- Click on the
Commit & Push
button at the top. - Enter a name for your Commit:
ingest: add Space Fuel Shop data source
. - Press
Commit
. - Wait for Y42's pre-configured checks to run.
Y42 will automatically push the changes to the Git remote.
Build source assets
With our source successfully set up, we can build the three tables and load them into the data warehouse. This is where the actual ingestion occurs: we take the raw data from the Postgres source and move it to a place where Y42 can work with it.
Build: materialize the table
- Open the drawer at the bottom of your screen
- Select the
Build
tab - Insert the build command to build and import all three tables:
y42 build -s source:raw_space_fuel_shop_data.*
- Click on
Visualize DAG
to see the assets this command will build - Click on
Build now
. You'll see a new row added to the build window. As the blue quarter bar indicates, the build is in queueing mode. - Click on the build row to see more details, where you'll see the pending jobs. You can also navigate to
Logs
to view build logs. If you click on an individual build job, you'll see that job's details. - After a couple of minutes, the build should be
Ready
.
Publication: add metadata
We'll add metadata to the assets to get the most out of our data. Y42 has extensive metadata features, which is really helpful when working with many data assets and experts. For now, we'll focus on just a few metadata features: column types, column tests, and column descriptions.
-
Click on
raw_orders
in the left file selector -
Click
Sync columns
-
Seven columns should be added. If they do not have the correct
Type
, update the type by double-clicking on theType
and entering the appropriate one:Column Type ID string
CUSTOMER_ID string
UPDATED_AT timestamp_ntz
FULFILLMENT_METHOD string
SERVICE_STATION string
LINE_ITEMS string
STATUS int
_Y42_EXTRACTED_AT timestamp_ntz
-
Add the following column tests:
Column Test Values ID Unique
CUSTOMER_ID Not null
FULFILLMENT_METHOD Accepted Values
SERVICED
,SELF_SERVICE
-
Add the following descriptions:
Column Type UPDATED_AT Time at which the order was published to the database in CET.
STATUS 0 for canceled, 1 for declined, 2 for accepted
-
Commit and push the changes with an appropriate commit title (e.g.,
meta: add raw_orders metadata
)
You can add more extensive tests and metadata, of course! As a general practice, it is a good idea to have at least a description per column and the appropriate data types selected.
View the data
We can now take a sneak peek at the data to see how it's shaping up:
- Open the bottom drawer
- Navigate to
Data
, where you will see a preview of the row count - Wait for the data to load and verify that it shows up
Up next
Now that we've created source tables in the data warehouse, we can start building a pipeline that combines the data and transforms it into new models!