AI

The right way to Scale Your Knowledge Pipelines and Knowledge Merchandise with Contract Testing and Dbt

First, we have to add two new dbt packages, dbt-expectations and dbt-utils, that may permit us to make assertions on the schema of our sources and the accepted values.

# packages.yml

packages:
- package deal: dbt-labs/dbt_utils
model: 1.1.1

- package deal: calogica/dbt_expectations
model: 0.8.5

Testing the info sources

Let’s begin by defining a contract take a look at for our first supply. We pull knowledge from raw_height, a desk that accommodates top info from the customers of the fitness center app.

We agree with our knowledge producers that we’ll obtain the peak measurement, the models for the measurements, and the person ID. We agree on the info sorts and that solely ‘cm’ and ‘inches’ are supported as models. With all this, we will outline our first contract within the dbt supply YAML file.

The constructing blocks

Trying on the earlier take a look at, we will see a number of of the dbt-unit-testing macros in use:

  • dbt_expectations.expect_column_values_to_be_of_type: This assertion permits us to outline the anticipated column knowledge kind.
  • accepted_values: This assertion permits us to outline an inventory of the accepeted values for a particular column.
  • dbt_utils.accepted_range: This assertion permits us to outline a numerical vary for a given column. Within the instance, we anticipated the column’s worth to not be lower than 0.
  • not null: Lastly, built-in assertions like ‘not null’ permit us to outline column constraints.

Utilizing these constructing blocks, we added a number of checks to outline the contract expectations described above. Discover additionally how we have now tagged the checks as “contract-test-source”. This tag permits us to run all contract checks in isolation, each domestically, and as we are going to see later, within the CI/CD pipeline:

dbt take a look at --select tag:contract-test-source

We now have seen how rapidly we will create contract checks for the sources of our dbt app, however what concerning the public interfaces of our knowledge pipeline or knowledge product?

As knowledge producers, we need to be sure that we’re producing knowledge based on the expectations of our knowledge shoppers so we will fulfill the contract we have now with them and make our data pipeline or data product trustworthy and reliable.

A easy means to make sure that we’re assembly our obligations to our knowledge shoppers is so as to add contract testing for our public interfaces.

Dbt recently released a new feature for SQL fashions, mannequin contracts, that enables to outline the contract for a dbt mannequin. Whereas constructing your mannequin, dbt will confirm that your mannequin’s transformation will produce a dataset matching up with its contract, or it is going to fail to construct.

Let’s see it in motion. Our mart, body_mass_indexes, produces a BMI metric from the load and top measure knowledge we get from our sources. The contract with our supplier establishes the next:

  • Knowledge sorts for every column.
  • Consumer IDs can’t be null
  • Consumer IDs are at all times larger than 0

Let’s outline the contract of the body_mass_indexes mannequin utilizing dbt mannequin contracts:

The constructing blocks

Trying on the earlier mannequin specification file, we will see a number of metadata that permit us to outline the contract.

  • contract.enforced: This configuration tells dbt that we need to implement the contract each time the mannequin is run.
  • data_type: This assertion permits us to outline the column kind we expect to supply as soon as the mannequin runs.
  • constraints: Lastly, the constraints block provides us the possibility to outline helpful constraints like {that a} column can’t be null, set main keys, and customized expressions. Within the instance above we outlined a constraint to inform dbt that the user_id should at all times be larger than 0. You may see all of the obtainable constraints here.

Supply contract checks vs dbt mannequin contracts

A distinction between the contract checks we outlined for our sources and those outlined for our marts or output ports is when the contracts are verified and enforced.

Dbt enforces model contracts when the mannequin is being generated by ‘dbt run’, whereas contracts primarily based on dbt tests are enforced when the dbt checks run.

If one of many mannequin contracts is just not happy, you will note an error once you execute ‘dbt run’ with particular particulars on the failure. You may see an instance within the following dbt run console output.

1 of 4 START sql desk mannequin dbt_testing_example.stg_gym_app__height ........... [RUN]
2 of 4 START sql desk mannequin dbt_testing_example.stg_gym_app__weight ........... [RUN]
2 of 4 OK created sql desk mannequin dbt_testing_example.stg_gym_app__weight ...... [SELECT 4 in 0.88s]
1 of 4 OK created sql desk mannequin dbt_testing_example.stg_gym_app__height ...... [SELECT 4 in 0.92s]
3 of 4 START sql desk mannequin dbt_testing_example.int_weight_measurements_with_latest_height [RUN]
3 of 4 OK created sql desk mannequin dbt_testing_example.int_weight_measurements_with_latest_height [SELECT 4 in 0.96s]
4 of 4 START sql desk mannequin dbt_testing_example.body_mass_indexes ............. [RUN]
4 of 4 ERROR creating sql desk mannequin dbt_testing_example.body_mass_indexes .... [ERROR in 0.77s]

Completed working 4 desk fashions in 0 hours 0 minutes and 6.28 seconds (6.28s).

Accomplished with 1 error and 0 warnings:

Database Error in mannequin body_mass_indexes (fashions/marts/body_mass_indexes.sql)
new row for relation "body_mass_indexes__dbt_tmp" violates verify constraint
"body_mass_indexes__dbt_tmp_user_id_check1"
DETAIL: Failing row accommodates (1, 2009-07-01, 82.5, null, null).
compiled Code at goal/run/dbt_testing_example/fashions/marts/body_mass_indexes.sql

Till now we have now a take a look at suite of highly effective contract checks, however how and when can we run them?

We are able to run contract checks in two kinds of pipelines.

  • CI/CD pipelines
  • Knowledge pipelines

For instance, you possibly can execute the supply contract checks on a schedule in a CI/CD pipeline focusing on the info sources obtainable in decrease environments like take a look at or staging. You may set the pipeline to fail each time the contract is just not met.

These failures offers priceless details about contract-breaking adjustments launched by different groups earlier than these adjustments attain manufacturing.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button