🚩 Create a free WhyLabs account to get more value out of whylogs!

Did you know you can store, visualize, and monitor whylogs profiles with the WhyLabs Observability Platform? Sign up for a free WhyLabs account to leverage the power of whylogs and WhyLabs together!

Data Validation for Spark Dataframes with whylogs¶

In this example, we will show how you can perform data validation for profiles that were created from a PySpark dataframe. This example is an advanced scenario that combines three topics. If you want to know more about each of these topics, please refer to the following tutorials:

In this example, we will:

Create a PySpark dataframe
Create a whylogs profile from a PySpark dataframe
Create two condition count metrics to check date format and url addresses
Create and visualize a set of constraints based on the condition count and other standard metrics

About the Dataset - 🛏️ Airbnb Listings in Rio de Janeiro, Brazil¶

We will read data made available from Airbnb. It's a listing dataset from the city of Rio de Janeiro, Brazil. We'll access data that was adapted from the following location: "http://data.insideairbnb.com/brazil/rj/rio-de-janeiro/2021-01-26/data/listings.csv.gz"

In this example, we want to do some basic data validation. Let's define those:

Completeness Checks
- id (long): should not contain any missing values
- listing_url (string): should not contain any missing values
- last_review (string): should not contain any missing values
Consistency Checks
- last_review (string): date should be in the format YYYY-MM-DD
- listing_url (string): should be an url from airbnb (starting with https://www.airbnb.com/rooms/)
- latitude and longitude (double): should be within the range of -24 to -22 and -44 to -43 respectively
- room_type (string): frequent strings should be in the set of expected values
Statistics Checks
- reviews_per_month (double): standard deviation should be in expected range

Installing the extra dependency¶

As we want to enable users to have exactly what they need to use from whylogs, the pyspark integration comes as an extra dependency. In order to have it available, simply uncomment and run the following cell:

In [1]:

# Note: you may need to restart the kernel to use updated packages.
%pip install 'whylogs[spark]'

Initializing a SparkSession¶

Here we will initialize a SparkSession. I'm also setting the pyarrow execution config, because it makes our methods even more performant.

IMPORTANT: Make sure you have Spark 3.0+ available in your environment, as our implementation relies on it for a smoother integration

In [ ]:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('whylogs-testing').getOrCreate()
arrow_config_key = "spark.sql.execution.arrow.pyspark.enabled"
spark.conf.set(arrow_config_key, "true")

Creating the PySpark dataframe¶

This is a relatively small dataset, so we can run this example locally.

In [3]:

from pyspark import SparkFiles

data_url = "https://whylabs-public.s3.us-west-2.amazonaws.com/whylogs_examples/Listings/airbnb_listings.parquet"
spark.sparkContext.addFile(data_url)

spark_dataframe = spark.read.parquet(SparkFiles.get("airbnb_listings.parquet"))

In [4]:

spark_dataframe.show(n=1, vertical=True)

-RECORD 0--------------------------------------
 name                   | Very Nice 2Br in ... 
 description            | Discounts for lon... 
 listing_url            | https://www.airbn... 
 last_review            | 2020-12-26           
 number_of_reviews_ltm  | 13                   
 number_of_reviews_l30d | 0                    
 id                     | 17878                
 latitude               | -22.96592            
 longitude              | -43.17896            
 availability_365       | 286                  
 bedrooms               | 2.0                  
 bathrooms              | null                 
 reviews_per_month      | 2.01                 
 room_type              | Entire home/apt      
only showing top 1 row

In [5]:

spark_dataframe.printSchema()

root
 |-- name: string (nullable = true)
 |-- description: string (nullable = true)
 |-- listing_url: string (nullable = true)
 |-- last_review: string (nullable = true)
 |-- number_of_reviews_ltm: long (nullable = true)
 |-- number_of_reviews_l30d: long (nullable = true)
 |-- id: long (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- availability_365: long (nullable = true)
 |-- bedrooms: double (nullable = true)
 |-- bathrooms: double (nullable = true)
 |-- reviews_per_month: double (nullable = true)
 |-- room_type: string (nullable = true)

Creating the Condition Count Metrics¶

To create a profile with the standard metrics, we can simply call collect_dataset_profile_view from whylog's PySpark extra module. However, if we look at our defined set of constraints, there are two of those that need to checked agains individual values:

last_review (string): date should be in the format YYYY-MM-DD
listing_url (string): should be an url from airbnb (starting with https://www.airbnb.com/rooms/)

As opposed to the other constraints, that can be checked against aggregate metrics, these two need to be checked against individual values. For that, we will create two condition count metrics. Later on, we will create metric constraints based on these metrics.

In [6]:

import datetime
from whylogs.core.relations import Predicate
from typing import Any
from whylogs.core.metrics.condition_count_metric import Condition
from whylogs.core.schema import DeclarativeSchema
from whylogs.core.resolvers import STANDARD_RESOLVER
from whylogs.core.specialized_resolvers import ConditionCountMetricSpec

def date_format(x: Any) -> bool:
    date_format = '%Y-%m-%d'
    try:
        datetime.datetime.strptime(x, date_format)
        return True
    except ValueError:
        return False

last_review_conditions = {"is_date_format": Condition(Predicate().is_(date_format))}
listing_url_conditions = {"url_matches_airbnb_domain": Condition(Predicate().matches("^https:\/\/www.airbnb.com\/rooms"))}

Now that we have the our set of conditions for both columns, we can create the condition count metrics. We can do so by creating a Standard Schema and then extending it by adding the condition count metrics with add_condition_count_metrics:

In [7]:

schema = DeclarativeSchema(STANDARD_RESOLVER)

schema.add_resolver_spec(column_name="last_review", metrics=[ConditionCountMetricSpec(last_review_conditions)])
schema.add_resolver_spec(column_name="listing_url", metrics=[ConditionCountMetricSpec(listing_url_conditions)])

To know more about condition count metrics and how to use them, check out the Metric Constraints with Condition Count Metrics example.

Profiling the PySpark DataFrame¶

Now, we can use the schema to pass to our logger through collect_dataset_profile_view

In [8]:

from whylogs.api.pyspark.experimental import collect_dataset_profile_view

dataset_profile_view = collect_dataset_profile_view(input_df=spark_dataframe, schema=schema)

This will create a profile with the standard metrics, as well as the two condition count metrics that we created. As a sanity check, let's see the metrics for the last_review column:

In [9]:

dataset_profile_view.get_column("last_review").get_metric_names()

Out[9]:

['types',
 'cardinality',
 'counts',
 'distribution',
 'frequent_items',
 'condition_count']

Creating and Visualizing Metric Constraints¶

We have all that we need to build our set of constraints. We will use out-of-the-box factory constraints to do that:

In [10]:

from whylogs.core.constraints.factories import condition_meets
from whylogs.core.constraints import ConstraintsBuilder
from whylogs.core.constraints.factories import no_missing_values
from whylogs.core.constraints.factories import is_in_range
from whylogs.core.constraints.factories import stddev_between_range
from whylogs.core.constraints.factories import frequent_strings_in_reference_set

builder = ConstraintsBuilder(dataset_profile_view=dataset_profile_view)
reference_set = {"Entire home/apt", "Private room", "Shared room", "Hotel room"}

builder.add_constraint(condition_meets(column_name="last_review", condition_name="is_date_format"))
builder.add_constraint(condition_meets(column_name="listing_url", condition_name="url_matches_airbnb_domain"))
builder.add_constraint(no_missing_values(column_name="last_review"))
builder.add_constraint(no_missing_values(column_name="listing_url"))
builder.add_constraint(is_in_range(column_name="latitude",lower=-24,upper=-22))
builder.add_constraint(is_in_range(column_name="longitude",lower=-44,upper=-43))
builder.add_constraint(no_missing_values(column_name="id"))
builder.add_constraint(stddev_between_range(column_name="reviews_per_month", lower=0.8, upper=1.1))
builder.add_constraint(frequent_strings_in_reference_set(column_name="room_type", reference_set=reference_set))

constraints = builder.build()
constraints.generate_constraints_report()

Out[10]:

[ReportResult(name='last_review meets condition is_date_format', passed=1, failed=0, summary=None),
 ReportResult(name='last_review has no missing values', passed=0, failed=1, summary=None),
 ReportResult(name='listing_url meets condition url_matches_airbnb_domain', passed=1, failed=0, summary=None),
 ReportResult(name='listing_url has no missing values', passed=1, failed=0, summary=None),
 ReportResult(name='latitude is in range [-24,-22]', passed=1, failed=0, summary=None),
 ReportResult(name='longitude is in range [-44,-43]', passed=1, failed=0, summary=None),
 ReportResult(name='id has no missing values', passed=1, failed=0, summary=None),
 ReportResult(name='reviews_per_month standard deviation between 0.8 and 1.1 (inclusive)', passed=1, failed=0, summary=None),
 ReportResult(name="room_type values in set {'Shared room', 'Hotel room', 'Private room', 'Entire home/apt'}", passed=1, failed=0, summary=None)]

If you're interested in a more complete list of helper constraints, please check out the Constraints Suite example.

Now, we can visualize the constraints report using the Notebook Profile Visualizer:

In [11]:

from whylogs.viz import NotebookProfileVisualizer

visualization = NotebookProfileVisualizer()
visualization.constraints_report(constraints, cell_height=300)

Connection error. Skip stats collection.

Out[11]:

Looks like we have some missing values for last_review. Other than that, the data looks good!

🚩 Create a free WhyLabs account to get more value out of whylogs!