#!/usr/bin/env python # coding: utf-8 # >### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*
# >*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=weather)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=weather) to leverage the power of whylogs and WhyLabs together!* # # Weather Forecast Dataset - Usage Example # [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/datasets/weather.ipynb) # This an example demonstrating the usage of the Weather Forecast Dataset. # # For more information about the dataset itself, check the documentation on : # https://whylogs.readthedocs.io/en/latest/datasets/weather.html # ## Installing the datasets module # # Uncomment the cell below if you don't have the `datasets` module installed: # In[1]: # Note: you may need to restart the kernel to use updated packages. get_ipython().run_line_magic('pip', "install 'whylogs[datasets]'") # ## Loading the Dataset # You can load the dataset of your choice by calling it from the `datasets` module: # In[2]: from whylogs.datasets import Weather dataset = Weather(version="in_domain") # This will create a folder in the current directory named `whylogs_data` with the csv files for the Weather Dataset. If the files already exist, the module will not redownload the files. # # Notice we're specifying the version of the dataset. A dataset can have multiple versions that can be used for differente purposes. In this case, the version "in_domain" has data from the same domain between baseline and inference subsets (data from the same set of regions - tropical, dry, polar, etc.). # # If we're interested in assessing drift issues, the version "out_domain" could be used, in which we have out-of-domain data in the inference subset, when compare to the baseline. # # Similarly, datasets could have other versions for other purposes, such as assessing data quality or outlier detection strategies. # ## Discovering Information # To know what are the available versions for a given dataset, you can call: # In[3]: Weather.describe_versions() # To get access to overall description of the dataset: # In[4]: print(Weather.describe()[:1000]) # note: the output was truncated to first 1000 characters as `describe()` will print a rather lengthy description. # ## Getting Baseline Data # You can access data from two different partitions: the baseline dataset and inference dataset. # # The baseline can be accessed as a whole, whereas the inference dataset can be accessed in periodic batches, defined by the user. # # To get a `baseline` object, just call `dataset.get_baseline()`: # In[5]: from whylogs.datasets import Weather dataset = Weather(version="out_domain") baseline = dataset.get_baseline() # `baseline` will contain different attributes - one timestamp and five dataframes. # # - timestamp: the batch's timestamp (at the start) # - data: the complete dataframe # - features: input features # - target: output feature(s) # - prediction: output prediction and, possibly, features such as uncertainty, confidence, probability # - misc: metadata features that are not of any of the previous categories, but still contain relevant information about the data. # In[6]: baseline.timestamp # In[7]: baseline.extra.head() # ## Setting Parameters # With `set_parameters`, you can specify the timestamps for both baseline and inference datasets, as well as the inference interval. # # By default, the timestamp is set as: # - Current date for baseline dataset # - Tomorrow's date for inference dataset # # These timestamps can be defined by the user to any given day, including the dataset's original date. # # The `inference_interval` defines the interval for each batch: '1d' means that we will have daily batches, while '7d' would mean weekly batches. # To set the timestamps to the original dataset's date, set `original` to true, like below: # In[8]: # Currently, the inference interval takes a str in the format "Xd", where X is an integer between 1-30 dataset.set_parameters(inference_interval="1d", original=True) # In[9]: baseline = dataset.get_baseline() baseline.timestamp # You can set timestamp by using the `baseline_timestamp` and `inference_start_timestamp`, and the inference interval like below: # In[10]: from datetime import datetime, timezone now = datetime.now(timezone.utc) dataset.set_parameters(baseline_timestamp=now, inference_start_timestamp=now, inference_interval="1d") # > Note that we are passing the datetime converted to the UTC timezone. If a naive datetime is passed (no information on timezones), local time zone will be assumed. The local timestamp, however, will be converted to the proper datetime in UTC timezone. Passing a naive datetime will trigger a warning, letting you know of this behavior. # Note that if both `original` and a timestamp (baseline or inference) is passed simultaneously, the defined timestamp will be overwritten by the original dataset timestamp. # ## Getting Inference Data #1 - By Date # You can get inference data in two different ways. The first is to specify the exact date you want, which will return a single batch: # In[11]: batch = dataset.get_inference_data(target_date=now) # You can access the attributes just as showed before: # In[12]: batch.timestamp # In[13]: batch.data # In[14]: batch.prediction.head() # ## Getting Inference Data #2 - By Number of Batches # The second way is to specify the number of batches you want and also the date for the first batch. # # You can then iterate over the returned object to get the batches. You can then use the batch any way you want. Here's an example that retrieves daily batches for a period of 5 days and logs each one with __whylogs__, saving the binary profiles to disk: # In[15]: import whylogs as why batches = dataset.get_inference_data(number_batches=5) for batch in batches: print("logging batch of size {} for {}".format(len(batch.data),batch.timestamp)) profile = why.log(batch.data).profile() profile.set_dataset_timestamp(batch.timestamp) profile.view().write("batch_{}".format(batch.timestamp))