🚩 Create a free WhyLabs account to get more value out of whylogs!

Did you know you can store, visualize, and monitor whylogs profiles with the WhyLabs Observability Platform? Sign up for a free WhyLabs account to leverage the power of whylogs and WhyLabs together!

Employee Dataset - Usage Example¶

This an example demonstrating the usage of the Employee Dataset.

For more information about the dataset itself, check the documentation on : https://whylogs.readthedocs.io/en/latest/datasets/employee.html

Installing the datasets module¶

In [1]:

# Note: you may need to restart the kernel to use updated packages.
%pip install 'whylogs[datasets]'

Loading the Dataset¶

You can load the dataset of your choice by calling it from the datasets module:

In [2]:

from whylogs.datasets import Employee

dataset = Employee(version="base")

If no version parameter is passed, the default version is base.

This will create a folder in the current directory named whylogs_data with the csv files for the Employee Dataset. If the files already exist, the module will not redownload the files.

Discovering Information¶

To know what are the available versions for a given dataset, you can call:

In [3]:

Employee.describe_versions()

Out[3]:

('base',)

To get access to overall description of the dataset:

In [4]:

print(Employee.describe()[:1000])

Employee Dataset
================

The employee dataset contains annual salary information for employees of an american County. It contains features related to each employee, such as employee's department, gender, salary, and hiring date.

The original data was sourced from the `employee_salaries` OpenML dataset, and can be found here: https://www.openml.org/d/42125. From the source data additional transformations were made, such as: data cleaning, feature creation and feature engineering.

License:
CC0: Public Domain

Usage
-----

You can follow this guide to see how to use the ecommerce dataset:

.. toctree::
    :maxdepth: 1

    ../examples/datasets/employee

Versions and Data Partitions
----------------------------

Currently the dataset contains one version: **base**. This dataset has no particular tasks defined, as it is aimed to explore data quality issues that are not necessarily related to ML.
The **base** version contains two partitions: **Baseline** and **Production**

base

note: the output was truncated to first 1000 characters as describe() will print a rather lengthy description.

Getting Baseline Data¶

You can access data from two different partitions: the baseline dataset and production dataset.

The baseline can be accessed as a whole, whereas the production dataset can be accessed in periodic batches, defined by the user.

To get a baseline object, just call dataset.get_baseline():

In [5]:

from whylogs.datasets import Employee

dataset = Employee()

baseline = dataset.get_baseline()

baseline will contain different attributes - one timestamp and five dataframes.

timestamp: the batch's timestamp (at the start)
data: the complete dataframe
features: input features
target: output feature(s)
prediction: output prediction and, possibly, features such as uncertainty, confidence, probability
extra: metadata features that are not of any of the previous categories, but still contain relevant information about the data.

The Employee dataset is a non-ml dataset, so the prediction and target dataframes will be empty.

In [6]:

baseline.timestamp

Out[6]:

datetime.datetime(2023, 2, 16, 0, 0, tzinfo=datetime.timezone.utc)

In [7]:

baseline.features.head()

Out[7]:

	employee_id	gender	overtime_pay	department	position_title	date_first_hired	year_first_hired	salary	full_time	part_time	sector
date
2023-02-16 00:00:00+00:00	8894	M	9136.78	POL	Police Sergeant	07/21/2003	2003	103506.00	1	0	Sector 3
2023-02-16 00:00:00+00:00	6920	M	0.00	FRS	Firefighter/Rescuer III	12/12/2016	2016	45261.00	1	0	Sector 1
2023-02-16 00:00:00+00:00	2265	F	0.00	LIB	Library Associate	06/27/1997	1997	25167.75	0	1	Sector 4
2023-02-16 00:00:00+00:00	8790	M	0.00	OHR	Labor Relations Advisor	10/28/2001	2001	112899.00	1	0	Sector 3
2023-02-16 00:00:00+00:00	7728	M	12516.95	DOT	Bus Operator	11/10/2014	2014	42053.42	1	0	Sector 4

Setting Parameters¶

With set_parameters, you can specify the timestamps for both baseline and production datasets, as well as the production interval.

By default, the timestamp is set as:

Current date for baseline dataset
Tomorrow's date for production dataset

These timestamps can be defined by the user to any given day, including the dataset's original date.

The production_interval defines the interval for each batch: '1d' means that we will have daily batches, while '7d' would mean weekly batches.

To set the timestamps to the original dataset's date, set original to true, like below:

In [8]:

# Currently, the production interval takes a str in the format "Xd", where X is an integer between 1-30
dataset.set_parameters(production_interval="1d", original=True)

In [9]:

baseline = dataset.get_baseline()
baseline.timestamp

Out[9]:

datetime.datetime(2023, 1, 16, 0, 0, tzinfo=datetime.timezone.utc)

You can set timestamp by using the baseline_timestamp and production_start_timestamp, and the production interval like below:

In [10]:

from datetime import datetime, timezone
now = datetime.now(timezone.utc)
dataset.set_parameters(baseline_timestamp=now, production_start_timestamp=now, production_interval="1d")

Note that we are passing the datetime converted to the UTC timezone. If a naive datetime is passed (no information on timezones), local time zone will be assumed. The local timestamp, however, will be converted to the proper datetime in UTC timezone. Passing a naive datetime will trigger a warning, letting you know of this behavior.

Note that if both original and a timestamp (baseline or production) is passed simultaneously, the defined timestamp will be overwritten by the original dataset timestamp.

Getting Inference Data #1 - By Date¶

You can get production data in two different ways. The first is to specify the exact date you want, which will return a single batch:

In [11]:

batch = dataset.get_production_data(target_date=now)

You can access the attributes just as showed before:

In [12]:

batch.timestamp

Out[12]:

datetime.datetime(2023, 2, 16, 0, 0, tzinfo=datetime.timezone.utc)

In [13]:

batch.data

Out[13]:

	employee_id	gender	overtime_pay	department	assignment_category	position_title	date_first_hired	year_first_hired	salary	full_time	part_time	sector
date
2023-02-16 00:00:00+00:00	6309	F	0.00	HHS	Fulltime-Regular	Administrative Specialist I	02/08/2016	2016	59276.91	1	0	Sector 1
2023-02-16 00:00:00+00:00	4078	M	19677.72	POL	Fulltime-Regular	Police Officer III	06/25/1990	1990	92756.70	1	0	Sector 3
2023-02-16 00:00:00+00:00	2445	F	0.00	DEP	Fulltime-Regular	Planning Specialist III	06/30/2014	2014	80499.91	1	0	Sector 4
2023-02-16 00:00:00+00:00	2548	F	0.00	REC	Fulltime-Regular	Recreation Specialist	03/24/2014	2014	69842.16	1	0	Sector 2
2023-02-16 00:00:00+00:00	5949	M	45267.21	DGS	Fulltime-Regular	Property Manager II	05/07/1990	1990	99870.24	1	0	Sector 3
...	...	...	...	...	...	...	...	...	...	...	...	...
2023-02-16 00:00:00+00:00	8594	F	0.00	CCL	Fulltime-Regular	Confidential Aide	05/05/2003	2003	146664.49	1	0	Sector 4
2023-02-16 00:00:00+00:00	3479	M	17711.08	FRS	Fulltime-Regular	Firefighter/Rescuer III	02/27/2012	2012	60618.00	1	0	Sector 1
2023-02-16 00:00:00+00:00	6067	F	0.00	HHS	Parttime-Regular	School Health Room Technician I	08/06/2012	2012	36797.13	0	1	Sector 1
2023-02-16 00:00:00+00:00	5788	M	9526.23	DLC	Fulltime-Regular	Liquor Store Clerk II	04/04/2000	2000	57760.61	1	0	Sector 2
2023-02-16 00:00:00+00:00	4375	M	1020.28	DOT	Fulltime-Regular	Motor Pool Attendant	05/27/2008	2008	36493.52	1	0	Sector 4

916 rows × 12 columns

Getting Inference Data #2 - By Number of Batches¶

The second way is to specify the number of batches you want and also the date for the first batch.

You can then iterate over the returned object to get the batches. You can then use the batch any way you want. Here's an example that retrieves daily batches for a period of 5 days and logs each one with whylogs, saving the binary profiles to disk:

In [14]:

import whylogs as why
batches = dataset.get_production_data(number_batches=5)

for batch in batches:
  print("logging batch of size {} for {}".format(len(batch.data),batch.timestamp))
  profile = why.log(batch.data).profile()
  profile.set_dataset_timestamp(batch.timestamp)
  profile.view().write("batch_{}".format(batch.timestamp))

logging batch of size 916 for 2023-02-16 00:00:00+00:00
logging batch of size 818 for 2023-02-17 00:00:00+00:00
logging batch of size 891 for 2023-02-18 00:00:00+00:00
logging batch of size 935 for 2023-02-19 00:00:00+00:00
logging batch of size 854 for 2023-02-20 00:00:00+00:00

🚩 Create a free WhyLabs account to get more value out of whylogs!