🚩 Create a free WhyLabs account to get more value out of whylogs!

Did you know you can store, visualize, and monitor whylogs profiles with the WhyLabs Observability Platform? Sign up for a free WhyLabs account to leverage the power of whylogs and WhyLabs together!

Merging Profiles¶

Sometimes we may want to profile a dataset in chunks. For example, we may have our dataset distributed across multiple files or nodes, or perhaps our dataset is too large to fit in memory. Maybe we already profiled our dataset for several different date ranges and we want to see a holistic view of our data across the entire range.

In any case, merging profiles is a solution!

Installing whylogs¶

whylogs is made available as a Python package. You can get the latest version from PyPI with pip install whylogs:

In [1]:

# Note: you may need to restart the kernel to use updated packages.
%pip install whylogs

Loading a Pandas DataFrame¶

Before profiling data, lets create a Pandas DataFrame from a public dataset.

In [2]:

import pandas as pd

df_full= pd.read_csv("https://whylabs-public.s3.us-west-2.amazonaws.com/datasets/tour/current.csv")

print('row count: {}'.format(df_full.shape[0]))
df_full.sample(10)

row count: 945

Out[2]:

	Transaction ID	Customer ID	Quantity	Item Price	Total Tax	Total Amount	Store Type	Product Category	Product Subcategory	Gender	Transaction Type	Age
682	T74278458640	C267835	2	63.2	13.2720	139.6720	TeleShop	Books	DIY	M	Purchase	33.0
256	T54377774372	C270496	5	75.5	39.6375	417.1375	MBR	Electronics	Audio and video	M	Purchase	23.0
67	T64030190529	C269524	5	52.5	27.5625	290.0625	e-Shop	Bags	Mens	F	Purchase	24.0
762	T18970114223	C272730	2	80.6	16.9260	178.1260	e-Shop	Home and kitchen	Kitchen	M	Purchase	39.0
94	T94404065446	C271648	5	48.0	25.2000	265.2000	e-Shop	Clothing	Kids	F	Purchase	34.0
197	T30540748600	C269603	1	127.4	13.3770	140.7770	TeleShop	Footwear	Women	M	Purchase	41.0
161	T78998671169	C270907	1	104.9	11.0145	115.9145	MBR	Books	DIY	F	Purchase	41.0
574	T19424023275	C270462	2	127.5	26.7750	281.7750	TeleShop	Electronics	Cameras	F	Purchase	38.0
583	T7986658313	C269047	3	25.9	8.1585	85.8585	e-Shop	Clothing	Women	M	Purchase	25.0
805	T36786634925	C267437	3	50.0	15.7500	165.7500	TeleShop	Books	Children	M	Purchase	35.0

This dataset contains 945 rows and contains a mix of numeric and categorical features. Lets split this DataFrame into 3 chunks of different sizes.

Splitting the DataFrame¶

In [3]:

df_subset1= df_full[0:100]
df_subset2= df_full[100:400]
df_subset3= df_full[400:]

print('Row Counts:')
print('Subset 1: {}'.format(df_subset1.shape[0]))
print('Subset 2: {}'.format(df_subset2.shape[0]))
print('Subset 3: {}'.format(df_subset3.shape[0]))

Row Counts:
Subset 1: 100
Subset 2: 300
Subset 3: 545

Profiling a Single Dataset¶

Lets profile the first subset.

In [4]:

import whylogs as why

results = why.log(df_subset1)
profile = results.profile()

The code above generates a ProfileResultSet instance and assigns it to the results variable. We then call the profile method on this object to generate a DatasetProfile instance which we assign to the profile variable.

We can inspect our profile by generating a pandas DataFrame from it. Lets view the first few rows.

In [5]:

subset1_profile_df = profile.view().to_pandas()
subset1_profile_df.head()

Out[5]:

	counts/n	counts/null	types/integral	types/fractional	types/boolean	types/string	types/object	cardinality/est	cardinality/upper_1	cardinality/lower_1	...	distribution/q_05	distribution/q_10	distribution/q_25	distribution/median	distribution/q_75	distribution/q_90	distribution/q_95	distribution/q_99	ints/max	ints/min
column
Gender	100	0	0	0	0	100	0	2.000000	2.000100	2.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Total Amount	100	0	0	100	0	0	0	99.000024	99.004967	99.0	...	-153.816	8.619	66.521	216.359	321.555	580.788	642.5575	795.6	NaN	NaN
Customer ID	100	0	0	0	0	100	0	98.000024	98.004917	98.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Item Price	100	0	0	100	0	0	0	97.000023	97.004866	97.0	...	10.000	25.700	40.700	76.800	111.100	135.200	139.8000	148.9	NaN	NaN
Transaction ID	100	0	0	0	0	100	0	99.000024	99.004967	99.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 28 columns

From the counts/n column, we can see that our subset of data contained 100 rows, as expected. Before we start merging new profiles, lets grab the mean of the "Item Price" column for another point of reference.

In [6]:

"Mean Item Price for Subset 1: {}".format(subset1_profile_df['distribution/mean'].loc['Item Price'])

Out[6]:

'Mean Item Price for Subset 1: 74.189'

Merging Profiles¶

We can call the track method on our profile to profile a new dataset and merge this with our existing profile in one step. This can be done successively for multiple subsets of data.

In [7]:

profile.track(df_subset2)
profile.track(df_subset3)

Lets now inspect the merged profile as a Pandas DataFrame

In [8]:

full_profile_df = profile.view().to_pandas()
full_profile_df.head()

Out[8]:

	counts/n	counts/null	types/integral	types/fractional	types/boolean	types/string	types/object	cardinality/est	cardinality/upper_1	cardinality/lower_1	...	distribution/q_05	distribution/q_10	distribution/q_25	distribution/median	distribution/q_75	distribution/q_90	distribution/q_95	distribution/q_99	ints/max	ints/min
column
Gender	945	0	0	0	0	945	0	2.000000	2.000100	2.000000	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Total Amount	945	0	0	945	0	0	0	844.069184	855.117588	833.289540	...	-233.376	14.365	79.0075	179.452	356.915	580.788	654.16	804.44	NaN	NaN
Customer ID	945	0	0	0	0	945	0	869.683985	881.067672	858.577213	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Item Price	945	0	0	945	0	0	0	705.028228	714.256661	696.024282	...	13.800	22.300	45.0000	80.600	116.600	138.200	145.10	149.00	NaN	NaN
Transaction ID	945	0	0	0	0	945	0	935.275741	947.517988	923.331294	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 28 columns

We now see that each column has a count of 945 which we expect. Lets revisit the mean of the Items Price column.

In [9]:

"Mean Item Price from merged profile: {}".format(full_profile_df['distribution/mean'].loc['Item Price'])

Out[9]:

'Mean Item Price from merged profile: 79.84814814814818'

Lets compare this with the mean we get using the the mean method from Pandas.

In [10]:

df_full['Item Price'].mean()

Out[10]:

79.84814814814808

Its nearly an exact match! Note that in this example, we profiled 3 datasets of unequal sizes independently and merged together 3 profiles. This merged profile captured telemetry describing our entire dataset.

This property of mergeability makes whylogs particularly powerful. It allows us to profile datasets which live in distributed pipeline even if our data is never together in one place at any time.

Mergeability also makes it a trivial matter to roll up from hourly to daily, weekly, or monthly level views of your data.

Merging Profile Views¶

Another option is to merge Profile Views.

A ProfileView object can be generated from a DatasetProfile object which allows for inspection of individiaul profiles, as well as the ability to visualize profiles using the our visualization module.

This is a good option if users wish to inspect profiles of their entire dataset while maintaining the ability to inspect individual profiles of the subsets of data.

In [11]:

results = why.log(df_subset1)
profile_view1 = results.profile().view()

results = why.log(df_subset2)
profile_view2 = results.profile().view()

results = why.log(df_subset3)
profile_view3 = results.profile().view()

Similar to the previous example, we find that the first profile view counted 100 rows in the subset of data it profiled.

In [12]:

profile_view1.to_pandas().head()

Out[12]:

	counts/n	counts/null	types/integral	types/fractional	types/boolean	types/string	types/object	cardinality/est	cardinality/upper_1	cardinality/lower_1	...	distribution/q_05	distribution/q_10	distribution/q_25	distribution/median	distribution/q_75	distribution/q_90	distribution/q_95	distribution/q_99	ints/max	ints/min
column
Gender	100	0	0	0	0	100	0	2.000000	2.000100	2.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Total Amount	100	0	0	100	0	0	0	99.000024	99.004967	99.0	...	-153.816	8.619	66.521	216.359	321.555	580.788	642.5575	795.6	NaN	NaN
Customer ID	100	0	0	0	0	100	0	98.000024	98.004917	98.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Item Price	100	0	0	100	0	0	0	97.000023	97.004866	97.0	...	10.000	25.700	40.700	76.800	111.100	135.200	139.8000	148.9	NaN	NaN
Transaction ID	100	0	0	0	0	100	0	99.000024	99.004967	99.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 28 columns

We can merge these ProfileView objects using the merge method. We assign the result to a new variable and view a few rows of the profile's DataFrame.

In [13]:

merged_profile_view = profile_view1.merge(profile_view2).merge(profile_view3)

merged_profile_view.to_pandas().head()

Out[13]:

	counts/n	counts/null	types/integral	types/fractional	types/boolean	types/string	types/object	frequent_items/frequent_strings	cardinality/est	cardinality/upper_1	...	distribution/q_05	distribution/q_10	distribution/q_25	distribution/median	distribution/q_75	distribution/q_90	distribution/q_95	distribution/q_99	ints/max	ints/min
column
Gender	945	0	0	0	0	945	0	[FrequentItem(value='M', est=489, upper=489, l...	2.000000	2.000100	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Total Amount	945	0	0	945	0	0	0	NaN	849.185065	860.300432	...	-233.376	14.365	78.676	178.126	357.0255	580.346	657.475	804.44	NaN	NaN
Customer ID	945	0	0	0	0	945	0	[FrequentItem(value='C273096', est=3, upper=2,...	858.998625	873.131713	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Item Price	945	0	0	945	0	0	0	NaN	701.002487	710.178225	...	15.000	22.700	45.200	81.200	116.7000	138.200	145.600	149.00	NaN	NaN
Transaction ID	945	0	0	0	0	945	0	[FrequentItem(value='T79960195196', est=3, upp...	942.466233	957.972612	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 28 columns

As expected, we see 945 rows. Unlike the track method, the merge method doesn't update the original objects directly. In other words, we can still inspect the individual profiles views from our subsets of data.

Keep in mind that the track method only works on DatasetProfile objects, while the merge method only operates on DatasetProfileView objects.

In [14]:

profile_view1.to_pandas().head()

Out[14]:

	counts/n	counts/null	types/integral	types/fractional	types/boolean	types/string	types/object	cardinality/est	cardinality/upper_1	cardinality/lower_1	...	distribution/q_05	distribution/q_10	distribution/q_25	distribution/median	distribution/q_75	distribution/q_90	distribution/q_95	distribution/q_99	ints/max	ints/min
column
Gender	100	0	0	0	0	100	0	2.000000	2.000100	2.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Total Amount	100	0	0	100	0	0	0	99.000024	99.004967	99.0	...	-153.816	8.619	66.521	216.359	321.555	580.788	642.5575	795.6	NaN	NaN
Customer ID	100	0	0	0	0	100	0	98.000024	98.004917	98.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Item Price	100	0	0	100	0	0	0	97.000023	97.004866	97.0	...	10.000	25.700	40.700	76.800	111.100	135.200	139.8000	148.9	NaN	NaN
Transaction ID	100	0	0	0	0	100	0	99.000024	99.004967	99.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 28 columns

Mergeability in WhyLabs¶

In WhyLabs, profile merging is done automatically. If you have a WhyLabs dataset with a daily batch frequency of 1 day, then any profiles uploaded during that day will automatically merged for a day-level view of your data.

What's Next?¶

There's a lot you can do with the profiles you just created. You can take a look at our other examples at https://whylogs.readthedocs.io/en/latest/examples !

🚩 Create a free WhyLabs account to get more value out of whylogs!

Merging Profiles¶

Installing whylogs¶

Loading a Pandas DataFrame¶

Splitting the DataFrame¶

Profiling a Single Dataset¶

Merging Profiles¶

Merging Profile Views¶

Mergeability in WhyLabs¶

What's Next?¶