🚩 Create a free WhyLabs account to get more value out of whylogs!
Did you know you can store, visualize, and monitor whylogs profiles with the WhyLabs Observability Platform? Sign up for a free WhyLabs account to leverage the power of whylogs and WhyLabs together!
This an example demonstrating the usage of the Ecommerce Dataset.
For more information about the dataset itself, check the documentation on : https://whylogs.readthedocs.io/en/latest/datasets/ecommerce.html
# Note: you may need to restart the kernel to use updated packages.
%pip install 'whylogs[datasets]'
You can load the dataset of your choice by calling it from the datasets
module:
from whylogs.datasets import Ecommerce
dataset = Ecommerce(version="base")
If no version
parameter is passed, the default version is base
.
This will create a folder in the current directory named whylogs_data
with the csv files for the Ecommerce Dataset. If the files already exist, the module will not redownload the files.
To know what are the available versions for a given dataset, you can call:
Ecommerce.describe_versions()
('base',)
To get access to overall description of the dataset:
print(Ecommerce.describe()[:1000])
Ecommerce Dataset ================= The Ecommerce dataset contains transaction information of several products for a popular grocery supermarket in India. It contains features such as the product's description, category, market price and user rating. The original data was sourced from Kaggle's [BigBasket Entire Product List](https://www.kaggle.com/datasets/surajjha101/bigbasket-entire-product-list-28k-datapoints). From the source data additional transformations were made, such as: oversampling and feature creation/engineering. License: CC BY-NC-SA 4.0 Usage ----- You can follow this guide to see how to use the ecommerce dataset: .. toctree:: :maxdepth: 1 ../examples/datasets/ecommerce Versions and Data Partitions ---------------------------- Currently the dataset contains one version: **base**. The task for the base version is to classify wether an incoming product should be provided a discount, given product features such as history of items sold, user rating, catego
note: the output was truncated to first 1000 characters as describe()
will print a rather lengthy description.
You can access data from two different partitions: the baseline dataset and inference dataset.
The baseline can be accessed as a whole, whereas the inference dataset can be accessed in periodic batches, defined by the user.
To get a baseline
object, just call dataset.get_baseline()
:
from whylogs.datasets import Weather
dataset = Ecommerce()
baseline = dataset.get_baseline()
baseline
will contain different attributes - one timestamp and five dataframes.
baseline.timestamp
datetime.datetime(2022, 9, 12, 0, 0, tzinfo=datetime.timezone.utc)
baseline.features.head()
product | sales_last_week | market_price | rating | category | |
---|---|---|---|---|---|
date | |||||
2022-09-12 00:00:00+00:00 | Wood - Centre Filled Bar Infused With Dark Mou... | 1 | 350.0 | 4.500000 | Snacks and Branded Foods |
2022-09-12 00:00:00+00:00 | Toasted Almonds | 1 | 399.0 | 3.944479 | Gourmet and World Food |
2022-09-12 00:00:00+00:00 | Instant Thai Noodles - Hot & Spicy Tomyum | 1 | 95.0 | 3.300000 | Gourmet and World Food |
2022-09-12 00:00:00+00:00 | Thokku - Vathakozhambu | 1 | 336.0 | 4.300000 | Snacks and Branded Foods |
2022-09-12 00:00:00+00:00 | Beetroot Powder | 1 | 150.0 | 3.944479 | Gourmet and World Food |
With set_parameters
, you can specify the timestamps for both baseline and inference datasets, as well as the inference interval.
By default, the timestamp is set as:
These timestamps can be defined by the user to any given day, including the dataset's original date.
The inference_interval
defines the interval for each batch: '1d' means that we will have daily batches, while '7d' would mean weekly batches.
To set the timestamps to the original dataset's date, set original
to true, like below:
# Currently, the inference interval takes a str in the format "Xd", where X is an integer between 1-30
dataset.set_parameters(inference_interval="1d", original=True)
baseline = dataset.get_baseline()
baseline.timestamp
datetime.datetime(2022, 8, 9, 0, 0, tzinfo=datetime.timezone.utc)
You can set timestamp by using the baseline_timestamp
and inference_start_timestamp
, and the inference interval like below:
from datetime import datetime, timezone
now = datetime.now(timezone.utc)
dataset.set_parameters(baseline_timestamp=now, inference_start_timestamp=now, inference_interval="1d")
Note that we are passing the datetime converted to the UTC timezone. If a naive datetime is passed (no information on timezones), local time zone will be assumed. The local timestamp, however, will be converted to the proper datetime in UTC timezone. Passing a naive datetime will trigger a warning, letting you know of this behavior.
Note that if both original
and a timestamp (baseline or inference) is passed simultaneously, the defined timestamp will be overwritten by the original dataset timestamp.
You can get inference data in two different ways. The first is to specify the exact date you want, which will return a single batch:
batch = dataset.get_inference_data(target_date=now)
You can access the attributes just as showed before:
batch.timestamp
datetime.datetime(2022, 9, 12, 0, 0, tzinfo=datetime.timezone.utc)
batch.data
product | sales_last_week | market_price | rating | category | category.Baby Care | category.Bakery, Cakes and Dairy | category.Beauty and Hygiene | category.Beverages | category.Cleaning and Household | category.Eggs, Meat and Fish | category.Foodgrains, Oil and Masala | category.Fruits and Vegetables | category.Gourmet and World Food | category.Kitchen, Garden and Pets | category.Snacks and Branded Foods | output_discount | output_prediction | output_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | |||||||||||||||||||
2022-09-12 00:00:00+00:00 | 1-2-3 Noodles - Veg Masala Flavour | 2 | 12.0 | 4.200000 | Snacks and Branded Foods | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1.000000 |
2022-09-12 00:00:00+00:00 | Jaggery Powder - Organic, Sulphur Free | 1 | 280.0 | 3.996552 | Gourmet and World Food | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0.571833 |
2022-09-12 00:00:00+00:00 | Pudding - Assorted | 3 | 50.0 | 4.400000 | Gourmet and World Food | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0.600000 |
2022-09-12 00:00:00+00:00 | Perfectly Moist Dark Chocolate Fudge Cake Mix ... | 1 | 495.0 | 4.000000 | Gourmet and World Food | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0.517833 |
2022-09-12 00:00:00+00:00 | Pasta/Spaghetti Spoon - Nylon, Silicon Handle,... | 1 | 299.0 | 3.732046 | Kitchen, Garden and Pets | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0.950000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2022-09-12 00:00:00+00:00 | Premium Fish Fillet | 1 | 250.0 | 3.931378 | Eggs, Meat and Fish | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0.910000 |
2022-09-12 00:00:00+00:00 | Organic Fennel & Nut Delight Laddoo - Low Carb... | 1 | 499.0 | 1.700000 | Snacks and Branded Foods | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0.622333 |
2022-09-12 00:00:00+00:00 | Steel Storage Deep Dabba/ Container Set With P... | 1 | 695.0 | 3.600000 | Kitchen, Garden and Pets | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0.990000 |
2022-09-12 00:00:00+00:00 | Venezia Large Bowl - Tempered Glass | 2 | 495.0 | 3.813672 | Kitchen, Garden and Pets | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0.860000 |
2022-09-12 00:00:00+00:00 | Cologne - Tattoo For Men | 1 | 799.0 | 4.000000 | Beauty and Hygiene | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0.585714 |
4133 rows × 19 columns
batch.prediction.head()
output_prediction | output_score | |
---|---|---|
date | ||
2022-09-12 00:00:00+00:00 | 0 | 1.000000 |
2022-09-12 00:00:00+00:00 | 0 | 0.571833 |
2022-09-12 00:00:00+00:00 | 1 | 0.600000 |
2022-09-12 00:00:00+00:00 | 1 | 0.517833 |
2022-09-12 00:00:00+00:00 | 1 | 0.950000 |
The second way is to specify the number of batches you want and also the date for the first batch.
You can then iterate over the returned object to get the batches. You can then use the batch any way you want. Here's an example that retrieves daily batches for a period of 5 days and logs each one with whylogs, saving the binary profiles to disk:
import whylogs as why
batches = dataset.get_inference_data(number_batches=5)
for batch in batches:
print("logging batch of size {} for {}".format(len(batch.data),batch.timestamp))
profile = why.log(batch.data).profile()
profile.set_dataset_timestamp(batch.timestamp)
profile.view().write("batch_{}".format(batch.timestamp))
logging batch of size 4133 for 2022-09-12 00:00:00+00:00 logging batch of size 4193 for 2022-09-13 00:00:00+00:00 logging batch of size 4136 for 2022-09-14 00:00:00+00:00 logging batch of size 4130 for 2022-09-15 00:00:00+00:00 logging batch of size 4131 for 2022-09-16 00:00:00+00:00