🚩 Create a free WhyLabs account to get more value out of whylogs!
Did you know you can store, visualize, and monitor whylogs profiles with the WhyLabs Observability Platform? Sign up for a free WhyLabs account to leverage the power of whylogs and WhyLabs together!
In this example, we will show how you can perform data validation for profiles that were created from a PySpark dataframe. This example is an advanced scenario that combines three topics. If you want to know more about each of these topics, please refer to the following tutorials:
In this example, we will:
We will read data made available from Airbnb. It's a listing dataset from the city of Rio de Janeiro, Brazil. We'll access data that was adapted from the following location: "http://data.insideairbnb.com/brazil/rj/rio-de-janeiro/2021-01-26/data/listings.csv.gz"
In this example, we want to do some basic data validation. Let's define those:
id
(long): should not contain any missing valueslisting_url
(string): should not contain any missing valueslast_review
(string): should not contain any missing valueslast_review
(string): date should be in the format YYYY-MM-DDlisting_url
(string): should be an url from airbnb (starting with https://www.airbnb.com/rooms/)latitude
and longitude
(double): should be within the range of -24 to -22 and -44 to -43 respectivelyroom_type
(string): frequent strings should be in the set of expected valuesreviews_per_month
(double): standard deviation should be in expected rangeAs we want to enable users to have exactly what they need to use from whylogs, the pyspark
integration comes as an extra dependency. In order to have it available, simply uncomment and run the following cell:
# Note: you may need to restart the kernel to use updated packages.
%pip install 'whylogs[spark]'
Here we will initialize a SparkSession. I'm also setting the pyarrow
execution config, because it makes our methods even more performant.
IMPORTANT: Make sure you have Spark 3.0+ available in your environment, as our implementation relies on it for a smoother integration
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('whylogs-testing').getOrCreate()
arrow_config_key = "spark.sql.execution.arrow.pyspark.enabled"
spark.conf.set(arrow_config_key, "true")
This is a relatively small dataset, so we can run this example locally.
from pyspark import SparkFiles
data_url = "https://whylabs-public.s3.us-west-2.amazonaws.com/whylogs_examples/Listings/airbnb_listings.parquet"
spark.sparkContext.addFile(data_url)
spark_dataframe = spark.read.parquet(SparkFiles.get("airbnb_listings.parquet"))
spark_dataframe.show(n=1, vertical=True)
-RECORD 0-------------------------------------- name | Very Nice 2Br in ... description | Discounts for lon... listing_url | https://www.airbn... last_review | 2020-12-26 number_of_reviews_ltm | 13 number_of_reviews_l30d | 0 id | 17878 latitude | -22.96592 longitude | -43.17896 availability_365 | 286 bedrooms | 2.0 bathrooms | null reviews_per_month | 2.01 room_type | Entire home/apt only showing top 1 row
spark_dataframe.printSchema()
root |-- name: string (nullable = true) |-- description: string (nullable = true) |-- listing_url: string (nullable = true) |-- last_review: string (nullable = true) |-- number_of_reviews_ltm: long (nullable = true) |-- number_of_reviews_l30d: long (nullable = true) |-- id: long (nullable = true) |-- latitude: double (nullable = true) |-- longitude: double (nullable = true) |-- availability_365: long (nullable = true) |-- bedrooms: double (nullable = true) |-- bathrooms: double (nullable = true) |-- reviews_per_month: double (nullable = true) |-- room_type: string (nullable = true)
To create a profile with the standard metrics, we can simply call collect_dataset_profile_view
from whylog's PySpark extra module. However, if we look at our defined set of constraints, there are two of those that need to checked agains individual values:
last_review
(string): date should be in the format YYYY-MM-DDlisting_url
(string): should be an url from airbnb (starting with https://www.airbnb.com/rooms/)As opposed to the other constraints, that can be checked against aggregate metrics, these two need to be checked against individual values. For that, we will create two condition count metrics. Later on, we will create metric constraints based on these metrics.
import datetime
from whylogs.core.relations import Predicate
from typing import Any
from whylogs.core.metrics.condition_count_metric import Condition
from whylogs.core.schema import DeclarativeSchema
from whylogs.core.resolvers import STANDARD_RESOLVER
from whylogs.core.specialized_resolvers import ConditionCountMetricSpec
def date_format(x: Any) -> bool:
date_format = '%Y-%m-%d'
try:
datetime.datetime.strptime(x, date_format)
return True
except ValueError:
return False
last_review_conditions = {"is_date_format": Condition(Predicate().is_(date_format))}
listing_url_conditions = {"url_matches_airbnb_domain": Condition(Predicate().matches("^https:\/\/www.airbnb.com\/rooms"))}
Now that we have the our set of conditions for both columns, we can create the condition count metrics. We can do so by creating a Standard Schema and then extending it by adding the condition count metrics with add_condition_count_metrics
:
schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_name="last_review", metrics=[ConditionCountMetricSpec(last_review_conditions)])
schema.add_resolver_spec(column_name="listing_url", metrics=[ConditionCountMetricSpec(listing_url_conditions)])
To know more about condition count metrics and how to use them, check out the Metric Constraints with Condition Count Metrics example.
Now, we can use the schema to pass to our logger through collect_dataset_profile_view
from whylogs.api.pyspark.experimental import collect_dataset_profile_view
dataset_profile_view = collect_dataset_profile_view(input_df=spark_dataframe, schema=schema)
This will create a profile with the standard metrics, as well as the two condition count metrics that we created. As a sanity check, let's see the metrics for the last_review
column:
dataset_profile_view.get_column("last_review").get_metric_names()
['types', 'cardinality', 'counts', 'distribution', 'frequent_items', 'condition_count']
We have all that we need to build our set of constraints. We will use out-of-the-box factory constraints to do that:
from whylogs.core.constraints.factories import condition_meets
from whylogs.core.constraints import ConstraintsBuilder
from whylogs.core.constraints.factories import no_missing_values
from whylogs.core.constraints.factories import is_in_range
from whylogs.core.constraints.factories import stddev_between_range
from whylogs.core.constraints.factories import frequent_strings_in_reference_set
builder = ConstraintsBuilder(dataset_profile_view=dataset_profile_view)
reference_set = {"Entire home/apt", "Private room", "Shared room", "Hotel room"}
builder.add_constraint(condition_meets(column_name="last_review", condition_name="is_date_format"))
builder.add_constraint(condition_meets(column_name="listing_url", condition_name="url_matches_airbnb_domain"))
builder.add_constraint(no_missing_values(column_name="last_review"))
builder.add_constraint(no_missing_values(column_name="listing_url"))
builder.add_constraint(is_in_range(column_name="latitude",lower=-24,upper=-22))
builder.add_constraint(is_in_range(column_name="longitude",lower=-44,upper=-43))
builder.add_constraint(no_missing_values(column_name="id"))
builder.add_constraint(stddev_between_range(column_name="reviews_per_month", lower=0.8, upper=1.1))
builder.add_constraint(frequent_strings_in_reference_set(column_name="room_type", reference_set=reference_set))
constraints = builder.build()
constraints.generate_constraints_report()
[ReportResult(name='last_review meets condition is_date_format', passed=1, failed=0, summary=None), ReportResult(name='last_review has no missing values', passed=0, failed=1, summary=None), ReportResult(name='listing_url meets condition url_matches_airbnb_domain', passed=1, failed=0, summary=None), ReportResult(name='listing_url has no missing values', passed=1, failed=0, summary=None), ReportResult(name='latitude is in range [-24,-22]', passed=1, failed=0, summary=None), ReportResult(name='longitude is in range [-44,-43]', passed=1, failed=0, summary=None), ReportResult(name='id has no missing values', passed=1, failed=0, summary=None), ReportResult(name='reviews_per_month standard deviation between 0.8 and 1.1 (inclusive)', passed=1, failed=0, summary=None), ReportResult(name="room_type values in set {'Shared room', 'Hotel room', 'Private room', 'Entire home/apt'}", passed=1, failed=0, summary=None)]
If you're interested in a more complete list of helper constraints, please check out the Constraints Suite example.
Now, we can visualize the constraints report using the Notebook Profile Visualizer:
from whylogs.viz import NotebookProfileVisualizer
visualization = NotebookProfileVisualizer()
visualization.constraints_report(constraints, cell_height=300)
Connection error. Skip stats collection.
Looks like we have some missing values for last_review
. Other than that, the data looks good!