#!/usr/bin/env python # coding: utf-8 # >### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*
# >*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Inspecting_Profiles)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Inspecting_Profiles) to leverage the power of whylogs and WhyLabs together!* # # Inspecting Profiles # # [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/basic/Inspecting_Profiles.ipynb) # # In this notebook, we'll show how you can use whylog's Profile Viewer (`profile.view()`) to find useful statistics in a dataset. # # This includes: # # - Counters, such as number of samples and null values # - Inferred types, such as integral, fractional, boolean, and strings # - Estimated cardinality # - Frequent items # - Distribution metrics: min, max, mean, median, standard deviation, and quantile values # # # # ## Setup # We'll need the `whylogs` and `pandas` libraries for this example. # # We'll also populate a dataframe with some data to inspect. # # In[1]: # install whylogs & pandas if needed # Note: you may need to restart the kernel to use updated packages. get_ipython().run_line_magic('pip', 'install whylogs') get_ipython().run_line_magic('pip', 'install pandas') # In[2]: # import whylogs and pandas import whylogs as why import pandas as pd # Set to show all columns in dataframe pd.set_option("display.max_columns", None) # In[3]: # create a simple test dataset data = { "animal": ["lion", "shark", "cat", "bear", "jellyfish", "kangaroo", "jellyfish", "jellyfish", "fish"], "legs": [4, 0, 4, 4.0, None, 2, None, None, "fins"], "weight": [14.3, 11.8, 4.3, 30.1,2.0,120.0,2.7,2.2, 1.2], } # Create dataframe with test dataset df = pd.DataFrame(data) # ## Log data with whylogs, create a profile, and view statistics: # # # In[4]: # Log data with whylogs & create profile results = why.log(pandas=df) profile = results.profile() # Create profile view dataframe prof_view = profile.view() prof_df = prof_view.to_pandas() # In[4]: # View Profile dataframe for dataset statistics prof_df # The number of rows of our dataframe will be equal to the number of columns in the logged data. Each column of the statistics' dataframe contains a specific dimension of a given **Metric**. # Taking a quick look at the generated statistics: # # #### animal # The animal row shows there are `9` entries (counts/n). All the data types are strings. Cardinality estimates that `7` different animal types are in the dataset. Frequent items show `jellyfish` appearing the most. # # #### weight # Our weight data contains `9` entries. All of them are `fractional` values. Cardinality shows that all `9` values are estimated to be unique. Since all entries were numerical the distribution statistics are generated. # # #### legs # We can see that there are `9` entries for leg values, but they're several different data types. `3 null`, `4 integrals`, `1 float`, and `1 string`. Cardinality estimates `5` unique values. The most frequent number of legs that appear in the dataset is `4`. # # # # ### Selecting a single value # A single cell can be selected to see full results if needed. # In[64]: # Select a single statistic by feature and row prof_df['frequent_items/frequent_strings']['animal'] # ## Understanding The whylogs Profile Statistics # # By default whylogs will automatically generate these metrics based on data types. # # The standard metrics available in whylogs are grouped in namespaces. They are: # # ### Counts & Inferred Data Types: # Counts and inferred data types track how many entries exist and what type data they contain. # # - `counts/n` - the total number of entries in a feature # - `counts/null` the number of null values # - `types/integral` - the number of values consisting of an integral (whole number) # - `types/fractional` - the number of values consisting of a fractional value (float) # - `types/boolean` - the number of values consisting of a boolean # - `types/string` - the number of values consisting of a string # - `types/object` - the number of values consisting of an object. If the data is not of any of the previous types, it will be assumed as an object # # ### Cardinality # Cardinality tracks an approximate unique value for each feature # # - `cardinality/est` - the estimated unique values for each feature # - `cardinality/upper_1` - upper bound for the cardinality estimation. The actual cardinality will always be below this number. # - `cardinality/lower_1` - lower bound for the cardinality estimation. The actual cardinality will always be above this number. # # ### Frequent Items: # Frequent items track which items show up the most. # # - `frequent_items/frequent_strings` - the most frequent items # # ### Distribution: # Distribution statistics are generated when a feature contains numerical data. # # - `distribution/mean` - the calculated mean of the feature data # - `distribution/stddev` - the calculated standard deviation of the feature data # - `distribution/n` - the number of rows belonging to the feature # - `distribution/max` - the highest (max) value in the feature # - `distribution/min` - the smallest (min) value in the feature # - `distribution/median` - the median value of the feature data # - `distribution/q_xx` - the xx-th quantile value of the data's distribution # # # ## Data Types and Metrics # # whylogs maps different data types, like numpy arrays, list, integers, etc. to specific whylogs data types. The three most important whylogs data types are: # # - Integral # - Fractional # - String # # By default, whylogs will track the following metrics according to the column's inferred data type: # # - Integral: # - `counts` # - `types` # - `distribution` # - `ints` # - `cardinality` # - `frequent_items` # - Fractional: # - `counts` # - `types` # - `cardinality` # - `distribution` # - String: # - `counts` # - `types` # - `cardinality` # - `frequent_items` # # If you want to know how you can customize this configuration, selecting the metrics according to the data type or column name, please go to the [Schema Configuration example](./Schema_Configuration.ipynb) # That's it! # If you want to know more about whylogs, check our [documentation](https://whylogs.readthedocs.io/en/latest/).