#!/usr/bin/env python # coding: utf-8 # >### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*
# >*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Merging_Profiles)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Merging_Profiles) to leverage the power of whylogs and WhyLabs together!* # # Merging Profiles # [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/basic/Merging_Profiles.ipynb) # Sometimes we may want to profile a dataset in chunks. For example, we may have our dataset distributed across multiple files or nodes, or perhaps our dataset is too large to fit in memory. Maybe we already profiled our dataset for several different date ranges and we want to see a holistic view of our data across the entire range. # # In any case, merging profiles is a solution! # # ## Installing whylogs # whylogs is made available as a Python package. You can get the latest version from PyPI with `pip install whylogs`: # In[1]: # Note: you may need to restart the kernel to use updated packages. get_ipython().run_line_magic('pip', 'install whylogs') # ## Loading a Pandas DataFrame # Before profiling data, lets create a Pandas DataFrame from a public dataset. # In[2]: import pandas as pd df_full= pd.read_csv("https://whylabs-public.s3.us-west-2.amazonaws.com/datasets/tour/current.csv") print('row count: {}'.format(df_full.shape[0])) df_full.sample(10) # This dataset contains 945 rows and contains a mix of numeric and categorical features. Lets split this DataFrame into 3 chunks of different sizes. # ## Splitting the DataFrame # In[3]: df_subset1= df_full[0:100] df_subset2= df_full[100:400] df_subset3= df_full[400:] print('Row Counts:') print('Subset 1: {}'.format(df_subset1.shape[0])) print('Subset 2: {}'.format(df_subset2.shape[0])) print('Subset 3: {}'.format(df_subset3.shape[0])) # ## Profiling a Single Dataset # Lets profile the first subset. # In[4]: import whylogs as why results = why.log(df_subset1) profile = results.profile() # The code above generates a *ProfileResultSet* instance and assigns it to the **results** variable. We then call the **profile** method on this object to generate a *DatasetProfile* instance which we assign to the **profile** variable. # # We can inspect our profile by generating a pandas DataFrame from it. Lets view the first few rows. # In[5]: subset1_profile_df = profile.view().to_pandas() subset1_profile_df.head() # From the **counts/n** column, we can see that our subset of data contained 100 rows, as expected. Before we start merging new profiles, lets grab the mean of the "Item Price" column for another point of reference. # In[6]: "Mean Item Price for Subset 1: {}".format(subset1_profile_df['distribution/mean'].loc['Item Price']) # ## Merging Profiles # We can call the track method on our profile to profile a new dataset and merge this with our existing profile in one step. This can be done successively for multiple subsets of data. # In[7]: profile.track(df_subset2) profile.track(df_subset3) # Lets now inspect the merged profile as a Pandas DataFrame # In[8]: full_profile_df = profile.view().to_pandas() full_profile_df.head() # We now see that each column has a count of 945 which we expect. Lets revisit the mean of the Items Price column. # In[9]: "Mean Item Price from merged profile: {}".format(full_profile_df['distribution/mean'].loc['Item Price']) # Lets compare this with the mean we get using the the **mean** method from Pandas. # In[10]: df_full['Item Price'].mean() # Its nearly an exact match! Note that in this example, we profiled 3 datasets of unequal sizes independently and merged together 3 profiles. This merged profile captured telemetry describing our entire dataset. # # This property of **mergeability** makes whylogs particularly powerful. It allows us to profile datasets which live in distributed pipeline even if our data is never together in one place at any time. # # Mergeability also makes it a trivial matter to roll up from hourly to daily, weekly, or monthly level views of your data. # ## Merging Profile Views # Another option is to merge Profile *Views*. # # A ProfileView object can be generated from a DatasetProfile object which allows for inspection of individiaul profiles, as well as the ability to visualize profiles using the our visualization module. # # This is a good option if users wish to inspect profiles of their entire dataset while maintaining the ability to inspect individual profiles of the subsets of data. # In[11]: results = why.log(df_subset1) profile_view1 = results.profile().view() results = why.log(df_subset2) profile_view2 = results.profile().view() results = why.log(df_subset3) profile_view3 = results.profile().view() # Similar to the previous example, we find that the first profile view counted 100 rows in the subset of data it profiled. # In[12]: profile_view1.to_pandas().head() # We can merge these ProfileView objects using the **merge** method. We assign the result to a new variable and view a few rows of the profile's DataFrame. # In[13]: merged_profile_view = profile_view1.merge(profile_view2).merge(profile_view3) merged_profile_view.to_pandas().head() # As expected, we see 945 rows. Unlike the **track** method, the merge method doesn't update the original objects directly. In other words, we can still inspect the individual profiles views from our subsets of data. # # Keep in mind that the **track** method only works on *DatasetProfile* objects, while the **merge** method only operates on *DatasetProfileView* objects. # In[14]: profile_view1.to_pandas().head() # ## Mergeability in WhyLabs # In WhyLabs, profile merging is done automatically. If you have a WhyLabs dataset with a daily batch frequency of 1 day, then any profiles uploaded during that day will automatically merged for a day-level view of your data. # ## What's Next? # There's a lot you can do with the profiles you just created. You can take a look at our other examples at https://whylogs.readthedocs.io/en/latest/examples !