dbt (data build tool) is an open source Python package that enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications. dbt allows you to build your data transformation pipeline using SQL queries. In this scenario, you will install the dbt Python package and create a sample dbt project. You will configure a database connection for dbt, write your first dbt models, and run a dbt pipeline. Finally, you will also gain insight into writing and running tests with dbt.

In [ ]:

%%capture
!pip install psycopg2-binary
!pip install boto3
!pip install dbt
!pip install awscli
!apt-get --quiet install tree
!pip install ipython-sql
%reload_ext sql

In [ ]:

# !mkdir -p ~/.aws

In [ ]:

# %%writefile ~/.aws/credentials
# [default]
# aws_access_key_id=
# aws_secret_access_key=

In [ ]:

# %%writefile ~/.aws/config
# [default]
# region=us-east-1
# output=json

In [ ]:

!mkdir -p ~/.aws && cp /content/drive/MyDrive/AWS/d01_admin/* ~/.aws

In [ ]:

!chmod 600 ~/.aws/credentials

In [ ]:

import os
import boto3
import json
from sqlalchemy import create_engine
import pandas as pd

In [ ]:

def get_secret(secret_name):
    region_name = "us-east-1"
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name)
    get_secret_value_response = client.get_secret_value(SecretId=secret_name)
    get_secret_value_response = json.loads(get_secret_value_response['SecretString'])
    return get_secret_value_response

In [ ]:

db_credentials = get_secret(secret_name='dev/detraining/rds')
USERNAME = db_credentials["sparsh_rds_postgres_username"]
PASSWORD = db_credentials["sparsh_rds_postgres_password"]
HOST = db_credentials["sparsh_rds_postgres_host"]
PORT = db_credentials["sparsh_rds_postgres_port"]
DBNAME = "postgres"
CONN = f"postgresql://{USERNAME}:{PASSWORD}@{HOST}:{PORT}/{DBNAME}"

Load a sample data into database¶

In this step, we will download the data, load it into pandas dataframe and then load this data into postgres database.

In [ ]:

!aws s3 ls s3://nyc-tlc/trip\ data/

In [ ]:

# download the data
# data source - https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
!wget -q --show-progress https://s3.amazonaws.com/nyc-tlc/trip\ data/yellow_tripdata_2019-01.parquet
!wget -q --show-progress https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv

yellow_tripdata_201 100%[===================>] 105.32M  72.5MB/s    in 1.5s    
taxi+_zone_lookup.c 100%[===================>]  12.03K  --.-KB/s    in 0s

In [ ]:

# read the data into pandas dataframe
import pandas as pd
yellow_tripdata_2019_df = pd.read_parquet('yellow_tripdata_2019-01.parquet')
yellow_tripdata_2019_df.head()

Out[ ]:

	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	improvement_surcharge	total_amount	congestion_surcharge	airport_fee
0	1	2019-01-01 00:46:40	2019-01-01 00:53:20	1.0	1.5	1.0	N	151	239	1	7.0	0.5	0.5	1.65	0.3	9.95	NaN	None
1	1	2019-01-01 00:59:47	2019-01-01 01:18:59	1.0	2.6	1.0	N	239	246	1	14.0	0.5	0.5	1.00	0.3	16.30	NaN	None
2	2	2018-12-21 13:48:30	2018-12-21 13:52:40	3.0	0.0	1.0	N	236	236	1	4.5	0.5	0.5	0.00	0.3	5.80	NaN	None
3	2	2018-11-28 15:52:25	2018-11-28 15:55:45	5.0	0.0	1.0	N	193	193	2	3.5	0.5	0.5	0.00	0.3	7.55	NaN	None
4	2	2018-11-28 15:56:57	2018-11-28 15:58:33	5.0	0.0	2.0	N	193	193	2	52.0	0.0	0.5	0.00	0.3	55.55	NaN	None

In [ ]:

# let's also see the schema
yellow_tripdata_2019_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7696617 entries, 0 to 7696616
Data columns (total 19 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   VendorID               int64         
 1   tpep_pickup_datetime   datetime64[ns]
 2   tpep_dropoff_datetime  datetime64[ns]
 3   passenger_count        float64       
 4   trip_distance          float64       
 5   RatecodeID             float64       
 6   store_and_fwd_flag     object        
 7   PULocationID           int64         
 8   DOLocationID           int64         
 9   payment_type           int64         
 10  fare_amount            float64       
 11  extra                  float64       
 12  mta_tax                float64       
 13  tip_amount             float64       
 14  tolls_amount           float64       
 15  improvement_surcharge  float64       
 16  total_amount           float64       
 17  congestion_surcharge   float64       
 18  airport_fee            object        
dtypes: datetime64[ns](2), float64(11), int64(4), object(2)
memory usage: 1.1+ GB

In [ ]:

# select only few columns that we are interested in
yellow_tripdata_2019_df = yellow_tripdata_2019_df[['VendorID',
                                                   'tpep_pickup_datetime',
                                                   'tpep_dropoff_datetime',
                                                   'passenger_count',
                                                   'PULocationID',
                                                   'DOLocationID',
                                                   'fare_amount']]

yellow_tripdata_2019_df.columns = ['vendor_id',
                                   'pickup_datetime',
                                   'dropoff_datetime',
                                   'passenger_count',
                                   'pickup_location_id',
                                   'dropoff_location_id',
                                   'fare_amount']

In [ ]:

# check the schema again
yellow_tripdata_2019_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7696617 entries, 0 to 7696616
Data columns (total 7 columns):
 #   Column               Dtype         
---  ------               -----         
 0   vendor_id            int64         
 1   pickup_datetime      datetime64[ns]
 2   dropoff_datetime     datetime64[ns]
 3   passenger_count      float64       
 4   pickup_location_id   int64         
 5   dropoff_location_id  int64         
 6   fare_amount          float64       
dtypes: datetime64[ns](2), float64(2), int64(3)
memory usage: 411.0 MB

In [ ]:

# let's also load the taxi zone data into pandas dataframe
taxi_zone_lookup = pd.read_csv('taxi+_zone_lookup.csv')
taxi_zone_lookup.head()

Out[ ]:

	LocationID	Borough	Zone	service_zone
0	1	EWR	Newark Airport	EWR
1	2	Queens	Jamaica Bay	Boro Zone
2	3	Bronx	Allerton/Pelham Gardens	Boro Zone
3	4	Manhattan	Alphabet City	Yellow Zone
4	5	Staten Island	Arden Heights	Boro Zone

In [ ]:

# rename the columns
taxi_zone_lookup.columns = ['locationid','borough','zone','service_zone']

In [ ]:

# check the schema
taxi_zone_lookup.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 265 entries, 0 to 264
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   locationid    265 non-null    int64 
 1   borough       265 non-null    object
 2   zone          264 non-null    object
 3   service_zone  263 non-null    object
dtypes: int64(1), object(3)
memory usage: 8.4+ KB

In [ ]:

# load the data into our postgres database
from sqlalchemy import create_engine
import psycopg2

alchemyEngine = create_engine(CONN, pool_recycle=3600);
postgreSQLConnection = alchemyEngine.connect();

yellow_tripdata_2019_df.sample(100000).to_sql('yellow_tripdata_sample_2019_01', postgreSQLConnection, if_exists='replace');
taxi_zone_lookup.to_sql('taxi_zone_lookup', postgreSQLConnection, if_exists='replace');

postgreSQLConnection.close();

In [ ]:

# confirm if data is loaded
%sql {CONN}

In [ ]:

%sql select * from yellow_tripdata_sample_2019_01 limit 10;

 * postgresql://postgres:***@database-1.ciykztisaaxg.us-east-1.rds.amazonaws.com:5432/postgres
10 rows affected.

Out[ ]:

index	vendor_id	pickup_datetime	dropoff_datetime	passenger_count	pickup_location_id	dropoff_location_id	fare_amount
6317675	1	2019-01-26 19:14:29	2019-01-26 19:36:30	1.0	138	24	24.0
3641213	2	2019-01-16 09:39:38	2019-01-16 10:00:58	1.0	263	87	23.5
5593383	1	2019-01-24 08:02:05	2019-01-24 08:09:41	1.0	162	137	6.5
3754727	1	2019-01-16 17:07:03	2019-01-16 17:16:20	1.0	125	231	7.0
6127276	2	2019-01-26 00:10:28	2019-01-26 00:21:10	1.0	79	234	9.0
277605	2	2019-01-02 14:51:35	2019-01-02 15:05:33	2.0	170	113	11.0
5182683	2	2019-01-22 16:16:13	2019-01-22 16:29:47	1.0	246	186	9.5
1247101	1	2019-01-06 19:19:28	2019-01-06 19:29:39	1.0	79	113	9.0
871245	2	2019-01-05 03:49:43	2019-01-05 04:09:31	3.0	137	87	14.0
1869997	2	2019-01-09 14:57:02	2019-01-09 15:01:59	1.0	144	79	5.0

In [ ]:

%sql select * from taxi_zone_lookup limit 10;

 * postgresql://postgres:***@database-1.ciykztisaaxg.us-east-1.rds.amazonaws.com:5432/postgres
10 rows affected.

Out[ ]:

index	locationid	borough	zone	service_zone
0	1	EWR	Newark Airport	EWR
1	2	Queens	Jamaica Bay	Boro Zone
2	3	Bronx	Allerton/Pelham Gardens	Boro Zone
3	4	Manhattan	Alphabet City	Yellow Zone
4	5	Staten Island	Arden Heights	Boro Zone
5	6	Staten Island	Arrochar/Fort Wadsworth	Boro Zone
6	7	Queens	Astoria	Boro Zone
7	8	Queens	Astoria Park	Boro Zone
8	9	Queens	Auburndale	Boro Zone
9	10	Queens	Baisley Park	Boro Zone

Setup the dbt project¶

In [ ]:

# initiate a project
!dbt init dbt_demo

Running with dbt=0.21.1
Creating dbt configuration folder at /root/.dbt
With sample profiles.yml for redshift

Your new dbt project "dbt_demo" was created! If this is your first time
using dbt, you'll need to set up your profiles.yml file -- this file will tell dbt how
to connect to your database. You can find this file by running:

  xdg-open /root/.dbt

For more information on how to configure the profiles.yml file,
please consult the dbt documentation here:

  https://docs.getdbt.com/docs/configure-your-profile

One more thing:

Need help? Don't hesitate to reach out to us via GitHub issues or on Slack:

  https://community.getdbt.com/

Happy modeling!

In [ ]:

# go into the newly created directory
%cd dbt_demo

/content/dbt_demo

In [ ]:

# let's see what's all there
!tree --du -h -C .

.
├── [4.0K]  analysis
├── [4.0K]  data
├── [1.3K]  dbt_project.yml
├── [4.0K]  macros
├── [9.0K]  models
│   └── [5.0K]  example
│       ├── [ 475]  my_first_dbt_model.sql
│       ├── [ 115]  my_second_dbt_model.sql
│       └── [ 437]  schema.yml
├── [ 571]  README.md
├── [4.0K]  snapshots
└── [4.0K]  tests

  35K used in 7 directories, 5 files

In [ ]:

# let's also see what's inside the `.dbt` folder
!tree --du -h -C ~/.dbt

/root/.dbt
└── [ 456]  profiles.yml

 4.4K used in 0 directories, 1 file

In [ ]:

profiles = f"""
default:
  outputs:
    dev:
      type: postgres
      threads: 1
      host: {HOST}
      port: {PORT}
      user: {USERNAME}
      pass: {PASSWORD}
      dbname: {DBNAME}
      schema: public
    prod:
      type: postgres
      threads: 1
      host: {HOST}
      port: {PORT}
      user: {USERNAME}
      pass: {PASSWORD}
      dbname: {DBNAME}
      schema: public
  target: dev
"""

%store profiles > ~/.dbt/profiles.yml

Writing 'profiles' (str) to file '/root/.dbt/profiles.yml'.

In [ ]:

!dbt debug

Running with dbt=0.21.1
dbt version: 0.21.1
python version: 3.7.13
python path: /usr/bin/python3
os info: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
Using profiles.yml file at /root/.dbt/profiles.yml
Using dbt_project.yml file at /content/dbt_demo/dbt_project.yml

Configuration:
  profiles.yml file [OK found and valid]
  dbt_project.yml file [OK found and valid]

Required dependencies:
 - git [OK found]

Connection:
  host: database-1.ciykztisaaxg.us-east-1.rds.amazonaws.com
  port: 5432
  user: postgres
  database: postgres
  schema: public
  search_path: None
  keepalives_idle: 0
  sslmode: None
  Connection test: [OK connection ok]

All checks passed!

Run dbt sample models¶

In [ ]:

!tree ./models

./models
└── example
    ├── my_first_dbt_model.sql
    ├── my_second_dbt_model.sql
    └── schema.yml

1 directory, 3 files

Inspect the Models¶

Note: Model files in dbt look very much like regular SQL queries, but they may contain some additional instructions using Jinja template syntax in the form of curly braces {{}}. For example, by using {{ ref('my_first_dbt_model') you can reference a model called my_first_dbt_model and can use it in a SQL query as you would any regular database table.

Navigate to the dbt_demo > models / example folder. You will see two .sql files and one .yml file. These are simply example models that were generated by the dbt init step:

my_first_dbt_model.sql creates some basic data consisting of two rows. It also contains a statement {{ config(materialized='table') }}, which instructs dbt to materialize the model as a table in the database.
my_second_dbt_model.sql references the first model using the {{ ref('my_first_dbt_model') }} syntax and selects all rows with id = 1 from the first model.
schema.yml contains metadata about both models, such as a human-readable description, the column schema, and tests for each model.

In [ ]:

!cat ./models/example/my_first_dbt_model.sql

/*
    Welcome to your first dbt model!
    Did you know that you can also configure models directly within SQL files?
    This will override configurations stated in dbt_project.yml

    Try changing "table" to "view" below
*/

{{ config(materialized='table') }}

with source_data as (

    select 1 as id
    union all
    select null as id

)

select *
from source_data

/*
    Uncomment the line below to remove records with null `id` values
*/

-- where id is not null

In [ ]:

!cat ./models/example/my_second_dbt_model.sql

-- Use the `ref` function to select from other models

select *
from {{ ref('my_first_dbt_model') }}
where id = 1

In [ ]:

!cat ./models/example/schema.yml

version: 2

models:
  - name: my_first_dbt_model
    description: "A starter dbt model"
    columns:
      - name: id
        description: "The primary key for this table"
        tests:
          - unique
          - not_null

  - name: my_second_dbt_model
    description: "A starter dbt model"
    columns:
      - name: id
        description: "The primary key for this table"
        tests:
          - unique
          - not_null

Run the Pipeline¶

In [ ]:

!dbt run

Running with dbt=0.21.1
Found 2 models, 4 tests, 0 snapshots, 0 analyses, 162 macros, 0 operations, 0 seed files, 0 sources, 0 exposures

08:48:57 | Concurrency: 1 threads (target='dev')
08:48:57 | 
08:48:57 | 1 of 2 START table model public.my_first_dbt_model................... [RUN]
08:48:58 | 1 of 2 OK created table model public.my_first_dbt_model.............. [SELECT 2 in 0.12s]
08:48:58 | 2 of 2 START view model public.my_second_dbt_model................... [RUN]
08:48:58 | 2 of 2 OK created view model public.my_second_dbt_model.............. [CREATE VIEW in 0.09s]
08:48:58 | 
08:48:58 | Finished running 1 table model, 1 view model in 0.53s.

Completed successfully

Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

This means that dbt successfully parsed out the two SQL queries in the model files, interpreted the instructions in the Jinja statements, and materialized the resulting data in your local PostgreSQL database into their respective views called my_first_dbt_model and my_second_dbt_model. Any consumers of the data could now query those views to access the transformed data.

Create new models¶

In this step, you will create your own dbt models to transform the sample data in the PostgreSQL database and run the pipeline.

Staging

These files are so-called staging models, a pattern commonly used in dbt to prevent access to raw data. Staging models typically simply select from the source data and, if needed, contain some light transformations such as column renamings.

In [ ]:

!mkdir -p ./models/taxi/staging

In [ ]:

%%writefile ./models/taxi/staging/schema.yml
version: 2

sources:
  - name: source
    schema: public
    tables:
      - name: taxi_zone_lookup
      - name: yellow_tripdata_sample_2019_01

models:
  - name: taxi_zone_lookup_model
    description: "A list of all taxi zones with codes in NYC"
    columns:
      - name: locationid
        tests:
          - not_null
      - name: borough
        tests:
          - not_null
      - name: zone
        tests:
          - not_null
      - name: service_zone
        tests:
          - not_null
  - name: taxi_trips_model
    description: "A reduced version of yellow taxi trip data in NYC"
    columns:
      - name: vendor_id
        tests:
          - not_null
          - accepted_values:
              values: ['1', '2', '4']
      - name: pickup_datetime
        tests:
          - not_null
      - name: dropoff_datetime
        tests:
          - not_null
      - name: passenger_count
        tests:
          - not_null
      - name: pickup_location_id
        tests:
          - not_null
      - name: dropoff_location_id
        tests:
          - not_null
      - name: fare_amount
        tests:
          - not_null

Overwriting ./models/taxi/staging/schema.yml

In [ ]:

%%writefile ./models/taxi/staging/taxi_trips_model.sql
select 
    vendor_id,
    pickup_datetime, 
    dropoff_datetime, 
    passenger_count, 
    pickup_location_id, 
    dropoff_location_id, 
    fare_amount
from {{ source('source', 'yellow_tripdata_sample_2019_01') }}

Overwriting ./models/taxi/staging/taxi_trips_model.sql

In [ ]:

%%writefile ./models/taxi/staging/taxi_zone_lookup_model.sql
select 
    locationid,
    borough,
    zone,
    service_zone
from {{ source('source', 'taxi_zone_lookup') }}

Overwriting ./models/taxi/staging/taxi_zone_lookup_model.sql

We will now create another dbt model, which combines data from the two staging models. Let's assume we want to write a query to join the staging tables on the location ID fields and add the actual location names to the pickup and dropoff locations of the taxi ride data.

In [ ]:

%%writefile ./models/taxi/trips_with_borough_name_model.sql
select
    t.vendor_id,
    t.pickup_datetime,
    t.dropoff_datetime,
    z1.borough as pickup_borough,
    z2.borough as dropoff_borough,
    t.passenger_count,
    t.fare_amount
from {{ ref('taxi_trips_model') }} t
left join {{ ref('taxi_zone_lookup_model') }} z1
on t.pickup_location_id = z1.locationid
left join {{ ref('taxi_zone_lookup_model') }} z2
on t.dropoff_location_id = z2.locationid

Overwriting ./models/taxi/trips_with_borough_name_model.sql

Create the schema¶

In [ ]:

%%writefile ./models/taxi/schema.yml
version: 2

models:
  - name: trips_with_borough_name
    description: "Combines taxi rides with the borough names for pickup and dropoff locations."
    columns:
      - name: vendor_id
      - name: pickup_datetime
      - name: dropoff_datetime
      - name: pickup_borough
      - name: dropoff_borough
      - name: passenger_count
      - name: fare_amount

Overwriting ./models/taxi/schema.yml

Configure the dbt_project.yml file¶

This file defines how the data in the models under taxi will be materialized. Modify the section to replace example with our new taxi directory:

In [ ]:

%%writefile dbt_project.yml
name: 'my_new_project'
version: '1.0.0'
config-version: 2
profile: 'default'
source-paths: ["models"]
analysis-paths: ["analysis"]
test-paths: ["tests"]
data-paths: ["data"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
target-path: "target"
clean-targets:
    - "target"
    - "dbt_modules"
models:
  my_new_project:
      taxi:
          materialized: view

Overwriting dbt_project.yml

Run the new pipeline¶

In [ ]:

!dbt run

Running with dbt=0.21.1
Found 5 models, 16 tests, 0 snapshots, 0 analyses, 162 macros, 0 operations, 0 seed files, 2 sources, 0 exposures

08:57:16 | Concurrency: 1 threads (target='dev')
08:57:16 | 
08:57:16 | 1 of 5 START table model public.my_first_dbt_model................... [RUN]
08:57:16 | 1 of 5 OK created table model public.my_first_dbt_model.............. [SELECT 2 in 0.13s]
08:57:16 | 2 of 5 START view model public.taxi_trips_model...................... [RUN]
08:57:16 | 2 of 5 OK created view model public.taxi_trips_model................. [CREATE VIEW in 0.09s]
08:57:16 | 3 of 5 START view model public.taxi_zone_lookup_model................ [RUN]
08:57:16 | 3 of 5 OK created view model public.taxi_zone_lookup_model........... [CREATE VIEW in 0.06s]
08:57:16 | 4 of 5 START view model public.my_second_dbt_model................... [RUN]
08:57:16 | 4 of 5 OK created view model public.my_second_dbt_model.............. [CREATE VIEW in 0.07s]
08:57:16 | 5 of 5 START view model public.trips_with_borough_name_model......... [RUN]
08:57:16 | 5 of 5 OK created view model public.trips_with_borough_name_model.... [CREATE VIEW in 0.06s]
08:57:17 | 
08:57:17 | Finished running 1 table model, 4 view models in 0.72s.

Completed successfully

Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5

Running tests¶

Run the staging schema tests

In [ ]:

!dbt test -m taxi_trips_model stg_taxi_zone_lookup_model

Running with dbt=0.21.1
Found 5 models, 16 tests, 0 snapshots, 0 analyses, 162 macros, 0 operations, 0 seed files, 2 sources, 0 exposures
The selection criterion 'stg_taxi_zone_lookup_model' does not match any nodes

08:58:12 | Concurrency: 1 threads (target='dev')
08:58:12 | 
08:58:12 | 1 of 8 START test accepted_values_taxi_trips_model_vendor_id__1__2__4 [RUN]
08:58:12 | 1 of 8 FAIL 1 accepted_values_taxi_trips_model_vendor_id__1__2__4.... [FAIL 1 in 0.08s]
08:58:12 | 2 of 8 START test not_null_taxi_trips_model_dropoff_datetime......... [RUN]
08:58:12 | 2 of 8 PASS not_null_taxi_trips_model_dropoff_datetime............... [PASS in 0.06s]
08:58:12 | 3 of 8 START test not_null_taxi_trips_model_dropoff_location_id...... [RUN]
08:58:12 | 3 of 8 PASS not_null_taxi_trips_model_dropoff_location_id............ [PASS in 0.05s]
08:58:12 | 4 of 8 START test not_null_taxi_trips_model_fare_amount.............. [RUN]
08:58:12 | 4 of 8 PASS not_null_taxi_trips_model_fare_amount.................... [PASS in 0.05s]
08:58:12 | 5 of 8 START test not_null_taxi_trips_model_passenger_count.......... [RUN]
08:58:12 | 5 of 8 FAIL 381 not_null_taxi_trips_model_passenger_count............ [FAIL 381 in 0.05s]
08:58:12 | 6 of 8 START test not_null_taxi_trips_model_pickup_datetime.......... [RUN]
08:58:12 | 6 of 8 PASS not_null_taxi_trips_model_pickup_datetime................ [PASS in 0.06s]
08:58:12 | 7 of 8 START test not_null_taxi_trips_model_pickup_location_id....... [RUN]
08:58:12 | 7 of 8 PASS not_null_taxi_trips_model_pickup_location_id............. [PASS in 0.06s]
08:58:12 | 8 of 8 START test not_null_taxi_trips_model_vendor_id................ [RUN]
08:58:12 | 8 of 8 PASS not_null_taxi_trips_model_vendor_id...................... [PASS in 0.05s]
08:58:12 | 
08:58:12 | Finished running 8 tests in 0.63s.

Completed with 2 errors and 0 warnings:

Failure in test accepted_values_taxi_trips_model_vendor_id__1__2__4 (models/taxi/staging/schema.yml)
  Got 1 result, configured to fail if != 0

  compiled SQL at target/compiled/my_new_project/models/taxi/staging/schema.yml/schema_test/accepted_values_taxi_trips_model_vendor_id__1__2__4.sql

Failure in test not_null_taxi_trips_model_passenger_count (models/taxi/staging/schema.yml)
  Got 381 results, configured to fail if != 0

  compiled SQL at target/compiled/my_new_project/models/taxi/staging/schema.yml/schema_test/not_null_taxi_trips_model_passenger_count.sql

Done. PASS=6 WARN=0 ERROR=2 SKIP=0 TOTAL=8

Correct the failure¶

In [ ]:

%%writefile ./models/taxi/staging/taxi_trips_model.sql
select 
    vendor_id,
    pickup_datetime, 
    dropoff_datetime, 
    passenger_count, 
    pickup_location_id, 
    dropoff_location_id, 
    fare_amount
from {{ source('source', 'yellow_tripdata_sample_2019_01') }} tt
where tt.vendor_id in ('1','2','4')
and tt.passenger_count is not null

Overwriting ./models/taxi/staging/taxi_trips_model.sql

In [ ]:

%%writefile ./models/taxi/staging/taxi_zone_lookup_model.sql
select 
    locationid,
    borough,
    zone,
    service_zone
from {{ source('source', 'taxi_zone_lookup') }} lk
where lk.zone is not null
and lk.service_zone is not null

Overwriting ./models/taxi/staging/taxi_zone_lookup_model.sql

Run the new pipeline again¶

In [ ]:

!dbt run

Running with dbt=0.21.1
Found 5 models, 16 tests, 0 snapshots, 0 analyses, 162 macros, 0 operations, 0 seed files, 2 sources, 0 exposures

08:59:07 | Concurrency: 1 threads (target='dev')
08:59:07 | 
08:59:07 | 1 of 5 START table model public.my_first_dbt_model................... [RUN]
08:59:08 | 1 of 5 OK created table model public.my_first_dbt_model.............. [SELECT 2 in 0.12s]
08:59:08 | 2 of 5 START view model public.taxi_trips_model...................... [RUN]
08:59:08 | 2 of 5 OK created view model public.taxi_trips_model................. [CREATE VIEW in 0.09s]
08:59:08 | 3 of 5 START view model public.taxi_zone_lookup_model................ [RUN]
08:59:08 | 3 of 5 OK created view model public.taxi_zone_lookup_model........... [CREATE VIEW in 0.07s]
08:59:08 | 4 of 5 START view model public.my_second_dbt_model................... [RUN]
08:59:08 | 4 of 5 OK created view model public.my_second_dbt_model.............. [CREATE VIEW in 0.06s]
08:59:08 | 5 of 5 START view model public.trips_with_borough_name_model......... [RUN]
08:59:08 | 5 of 5 OK created view model public.trips_with_borough_name_model.... [CREATE VIEW in 0.07s]
08:59:08 | 
08:59:08 | Finished running 1 table model, 4 view models in 0.73s.

Completed successfully

Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5

Running tests¶

In [ ]:

!dbt test -m taxi_trips_model stg_taxi_zone_lookup_model

Running with dbt=0.21.1
Found 5 models, 16 tests, 0 snapshots, 0 analyses, 162 macros, 0 operations, 0 seed files, 2 sources, 0 exposures
The selection criterion 'stg_taxi_zone_lookup_model' does not match any nodes

08:59:19 | Concurrency: 1 threads (target='dev')
08:59:19 | 
08:59:19 | 1 of 8 START test accepted_values_taxi_trips_model_vendor_id__1__2__4 [RUN]
08:59:20 | 1 of 8 PASS accepted_values_taxi_trips_model_vendor_id__1__2__4...... [PASS in 0.25s]
08:59:20 | 2 of 8 START test not_null_taxi_trips_model_dropoff_datetime......... [RUN]
08:59:20 | 2 of 8 PASS not_null_taxi_trips_model_dropoff_datetime............... [PASS in 0.18s]
08:59:20 | 3 of 8 START test not_null_taxi_trips_model_dropoff_location_id...... [RUN]
08:59:20 | 3 of 8 PASS not_null_taxi_trips_model_dropoff_location_id............ [PASS in 0.25s]
08:59:20 | 4 of 8 START test not_null_taxi_trips_model_fare_amount.............. [RUN]
08:59:20 | 4 of 8 PASS not_null_taxi_trips_model_fare_amount.................... [PASS in 0.08s]
08:59:20 | 5 of 8 START test not_null_taxi_trips_model_passenger_count.......... [RUN]
08:59:20 | 5 of 8 PASS not_null_taxi_trips_model_passenger_count................ [PASS in 0.07s]
08:59:20 | 6 of 8 START test not_null_taxi_trips_model_pickup_datetime.......... [RUN]
08:59:20 | 6 of 8 PASS not_null_taxi_trips_model_pickup_datetime................ [PASS in 0.07s]
08:59:20 | 7 of 8 START test not_null_taxi_trips_model_pickup_location_id....... [RUN]
08:59:20 | 7 of 8 PASS not_null_taxi_trips_model_pickup_location_id............. [PASS in 0.06s]
08:59:20 | 8 of 8 START test not_null_taxi_trips_model_vendor_id................ [RUN]
08:59:21 | 8 of 8 PASS not_null_taxi_trips_model_vendor_id...................... [PASS in 0.06s]
08:59:21 | 
08:59:21 | Finished running 8 tests in 1.25s.

Completed successfully

Done. PASS=8 WARN=0 ERROR=0 SKIP=0 TOTAL=8