#!/usr/bin/env python # coding: utf-8 # >### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*
# >*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=employee)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=employee) to leverage the power of whylogs and WhyLabs together!* # # Employee Dataset - Usage Example # [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/datasets/employee.ipynb) # This an example demonstrating the usage of the Employee Dataset. # # For more information about the dataset itself, check the documentation on : # https://whylogs.readthedocs.io/en/latest/datasets/employee.html # ## Installing the datasets module # In[1]: # Note: you may need to restart the kernel to use updated packages. get_ipython().run_line_magic('pip', "install 'whylogs[datasets]'") # ## Loading the Dataset # # You can load the dataset of your choice by calling it from the `datasets` module: # In[2]: from whylogs.datasets import Employee dataset = Employee(version="base") # If no `version` parameter is passed, the default version is `base`. # This will create a folder in the current directory named `whylogs_data` with the csv files for the Employee Dataset. If the files already exist, the module will not redownload the files. # ## Discovering Information # To know what are the available versions for a given dataset, you can call: # In[3]: Employee.describe_versions() # To get access to overall description of the dataset: # In[4]: print(Employee.describe()[:1000]) # note: the output was truncated to first 1000 characters as `describe()` will print a rather lengthy description. # ## Getting Baseline Data # # You can access data from two different partitions: the baseline dataset and production dataset. # # The baseline can be accessed as a whole, whereas the production dataset can be accessed in periodic batches, defined by the user. # # To get a `baseline` object, just call `dataset.get_baseline()`: # In[5]: from whylogs.datasets import Employee dataset = Employee() baseline = dataset.get_baseline() # `baseline` will contain different attributes - one timestamp and five dataframes. # # - timestamp: the batch's timestamp (at the start) # - data: the complete dataframe # - features: input features # - target: output feature(s) # - prediction: output prediction and, possibly, features such as uncertainty, confidence, probability # - extra: metadata features that are not of any of the previous categories, but still contain relevant information about the data. # # The Employee dataset is a non-ml dataset, so the `prediction` and `target` dataframes will be empty. # In[6]: baseline.timestamp # In[7]: baseline.features.head() # ## Setting Parameters # # With `set_parameters`, you can specify the timestamps for both baseline and production datasets, as well as the production interval. # # By default, the timestamp is set as: # - Current date for baseline dataset # - Tomorrow's date for production dataset # # These timestamps can be defined by the user to any given day, including the dataset's original date. # # The `production_interval` defines the interval for each batch: '1d' means that we will have daily batches, while '7d' would mean weekly batches. # To set the timestamps to the original dataset's date, set `original` to true, like below: # In[8]: # Currently, the production interval takes a str in the format "Xd", where X is an integer between 1-30 dataset.set_parameters(production_interval="1d", original=True) # In[9]: baseline = dataset.get_baseline() baseline.timestamp # You can set timestamp by using the `baseline_timestamp` and `production_start_timestamp`, and the production interval like below: # In[10]: from datetime import datetime, timezone now = datetime.now(timezone.utc) dataset.set_parameters(baseline_timestamp=now, production_start_timestamp=now, production_interval="1d") # > Note that we are passing the datetime converted to the UTC timezone. If a naive datetime is passed (no information on timezones), local time zone will be assumed. The local timestamp, however, will be converted to the proper datetime in UTC timezone. Passing a naive datetime will trigger a warning, letting you know of this behavior. # Note that if both `original` and a timestamp (baseline or production) is passed simultaneously, the defined timestamp will be overwritten by the original dataset timestamp. # ## Getting Inference Data #1 - By Date # You can get production data in two different ways. The first is to specify the exact date you want, which will return a single batch: # In[11]: batch = dataset.get_production_data(target_date=now) # You can access the attributes just as showed before: # In[12]: batch.timestamp # In[13]: batch.data # ## Getting Inference Data #2 - By Number of Batches # # The second way is to specify the number of batches you want and also the date for the first batch. # # You can then iterate over the returned object to get the batches. You can then use the batch any way you want. Here's an example that retrieves daily batches for a period of 5 days and logs each one with __whylogs__, saving the binary profiles to disk: # In[14]: import whylogs as why batches = dataset.get_production_data(number_batches=5) for batch in batches: print("logging batch of size {} for {}".format(len(batch.data),batch.timestamp)) profile = why.log(batch.data).profile() profile.set_dataset_timestamp(batch.timestamp) profile.view().write("batch_{}".format(batch.timestamp))