🚩 Create a free WhyLabs account to get more value out of whylogs!
Did you know you can store, visualize, and monitor whylogs profiles with the WhyLabs Observability Platform? Sign up for a free WhyLabs account to leverage the power of whylogs and WhyLabs together!
Sometimes we may want to profile a dataset in chunks. For example, we may have our dataset distributed across multiple files or nodes, or perhaps our dataset is too large to fit in memory. Maybe we already profiled our dataset for several different date ranges and we want to see a holistic view of our data across the entire range.
In any case, merging profiles is a solution!
whylogs is made available as a Python package. You can get the latest version from PyPI with pip install whylogs
:
# Note: you may need to restart the kernel to use updated packages.
%pip install whylogs
Before profiling data, lets create a Pandas DataFrame from a public dataset.
import pandas as pd
df_full= pd.read_csv("https://whylabs-public.s3.us-west-2.amazonaws.com/datasets/tour/current.csv")
print('row count: {}'.format(df_full.shape[0]))
df_full.sample(10)
row count: 945
Transaction ID | Customer ID | Quantity | Item Price | Total Tax | Total Amount | Store Type | Product Category | Product Subcategory | Gender | Transaction Type | Age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
682 | T74278458640 | C267835 | 2 | 63.2 | 13.2720 | 139.6720 | TeleShop | Books | DIY | M | Purchase | 33.0 |
256 | T54377774372 | C270496 | 5 | 75.5 | 39.6375 | 417.1375 | MBR | Electronics | Audio and video | M | Purchase | 23.0 |
67 | T64030190529 | C269524 | 5 | 52.5 | 27.5625 | 290.0625 | e-Shop | Bags | Mens | F | Purchase | 24.0 |
762 | T18970114223 | C272730 | 2 | 80.6 | 16.9260 | 178.1260 | e-Shop | Home and kitchen | Kitchen | M | Purchase | 39.0 |
94 | T94404065446 | C271648 | 5 | 48.0 | 25.2000 | 265.2000 | e-Shop | Clothing | Kids | F | Purchase | 34.0 |
197 | T30540748600 | C269603 | 1 | 127.4 | 13.3770 | 140.7770 | TeleShop | Footwear | Women | M | Purchase | 41.0 |
161 | T78998671169 | C270907 | 1 | 104.9 | 11.0145 | 115.9145 | MBR | Books | DIY | F | Purchase | 41.0 |
574 | T19424023275 | C270462 | 2 | 127.5 | 26.7750 | 281.7750 | TeleShop | Electronics | Cameras | F | Purchase | 38.0 |
583 | T7986658313 | C269047 | 3 | 25.9 | 8.1585 | 85.8585 | e-Shop | Clothing | Women | M | Purchase | 25.0 |
805 | T36786634925 | C267437 | 3 | 50.0 | 15.7500 | 165.7500 | TeleShop | Books | Children | M | Purchase | 35.0 |
This dataset contains 945 rows and contains a mix of numeric and categorical features. Lets split this DataFrame into 3 chunks of different sizes.
df_subset1= df_full[0:100]
df_subset2= df_full[100:400]
df_subset3= df_full[400:]
print('Row Counts:')
print('Subset 1: {}'.format(df_subset1.shape[0]))
print('Subset 2: {}'.format(df_subset2.shape[0]))
print('Subset 3: {}'.format(df_subset3.shape[0]))
Row Counts: Subset 1: 100 Subset 2: 300 Subset 3: 545
Lets profile the first subset.
import whylogs as why
results = why.log(df_subset1)
profile = results.profile()
The code above generates a ProfileResultSet instance and assigns it to the results variable. We then call the profile method on this object to generate a DatasetProfile instance which we assign to the profile variable.
We can inspect our profile by generating a pandas DataFrame from it. Lets view the first few rows.
subset1_profile_df = profile.view().to_pandas()
subset1_profile_df.head()
counts/n | counts/null | types/integral | types/fractional | types/boolean | types/string | types/object | cardinality/est | cardinality/upper_1 | cardinality/lower_1 | ... | distribution/q_05 | distribution/q_10 | distribution/q_25 | distribution/median | distribution/q_75 | distribution/q_90 | distribution/q_95 | distribution/q_99 | ints/max | ints/min | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | |||||||||||||||||||||
Gender | 100 | 0 | 0 | 0 | 0 | 100 | 0 | 2.000000 | 2.000100 | 2.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Total Amount | 100 | 0 | 0 | 100 | 0 | 0 | 0 | 99.000024 | 99.004967 | 99.0 | ... | -153.816 | 8.619 | 66.521 | 216.359 | 321.555 | 580.788 | 642.5575 | 795.6 | NaN | NaN |
Customer ID | 100 | 0 | 0 | 0 | 0 | 100 | 0 | 98.000024 | 98.004917 | 98.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Item Price | 100 | 0 | 0 | 100 | 0 | 0 | 0 | 97.000023 | 97.004866 | 97.0 | ... | 10.000 | 25.700 | 40.700 | 76.800 | 111.100 | 135.200 | 139.8000 | 148.9 | NaN | NaN |
Transaction ID | 100 | 0 | 0 | 0 | 0 | 100 | 0 | 99.000024 | 99.004967 | 99.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 28 columns
From the counts/n column, we can see that our subset of data contained 100 rows, as expected. Before we start merging new profiles, lets grab the mean of the "Item Price" column for another point of reference.
"Mean Item Price for Subset 1: {}".format(subset1_profile_df['distribution/mean'].loc['Item Price'])
'Mean Item Price for Subset 1: 74.189'
We can call the track method on our profile to profile a new dataset and merge this with our existing profile in one step. This can be done successively for multiple subsets of data.
profile.track(df_subset2)
profile.track(df_subset3)
Lets now inspect the merged profile as a Pandas DataFrame
full_profile_df = profile.view().to_pandas()
full_profile_df.head()
counts/n | counts/null | types/integral | types/fractional | types/boolean | types/string | types/object | cardinality/est | cardinality/upper_1 | cardinality/lower_1 | ... | distribution/q_05 | distribution/q_10 | distribution/q_25 | distribution/median | distribution/q_75 | distribution/q_90 | distribution/q_95 | distribution/q_99 | ints/max | ints/min | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | |||||||||||||||||||||
Gender | 945 | 0 | 0 | 0 | 0 | 945 | 0 | 2.000000 | 2.000100 | 2.000000 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Total Amount | 945 | 0 | 0 | 945 | 0 | 0 | 0 | 844.069184 | 855.117588 | 833.289540 | ... | -233.376 | 14.365 | 79.0075 | 179.452 | 356.915 | 580.788 | 654.16 | 804.44 | NaN | NaN |
Customer ID | 945 | 0 | 0 | 0 | 0 | 945 | 0 | 869.683985 | 881.067672 | 858.577213 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Item Price | 945 | 0 | 0 | 945 | 0 | 0 | 0 | 705.028228 | 714.256661 | 696.024282 | ... | 13.800 | 22.300 | 45.0000 | 80.600 | 116.600 | 138.200 | 145.10 | 149.00 | NaN | NaN |
Transaction ID | 945 | 0 | 0 | 0 | 0 | 945 | 0 | 935.275741 | 947.517988 | 923.331294 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 28 columns
We now see that each column has a count of 945 which we expect. Lets revisit the mean of the Items Price column.
"Mean Item Price from merged profile: {}".format(full_profile_df['distribution/mean'].loc['Item Price'])
'Mean Item Price from merged profile: 79.84814814814818'
Lets compare this with the mean we get using the the mean method from Pandas.
df_full['Item Price'].mean()
79.84814814814808
Its nearly an exact match! Note that in this example, we profiled 3 datasets of unequal sizes independently and merged together 3 profiles. This merged profile captured telemetry describing our entire dataset.
This property of mergeability makes whylogs particularly powerful. It allows us to profile datasets which live in distributed pipeline even if our data is never together in one place at any time.
Mergeability also makes it a trivial matter to roll up from hourly to daily, weekly, or monthly level views of your data.
Another option is to merge Profile Views.
A ProfileView object can be generated from a DatasetProfile object which allows for inspection of individiaul profiles, as well as the ability to visualize profiles using the our visualization module.
This is a good option if users wish to inspect profiles of their entire dataset while maintaining the ability to inspect individual profiles of the subsets of data.
results = why.log(df_subset1)
profile_view1 = results.profile().view()
results = why.log(df_subset2)
profile_view2 = results.profile().view()
results = why.log(df_subset3)
profile_view3 = results.profile().view()
Similar to the previous example, we find that the first profile view counted 100 rows in the subset of data it profiled.
profile_view1.to_pandas().head()
counts/n | counts/null | types/integral | types/fractional | types/boolean | types/string | types/object | cardinality/est | cardinality/upper_1 | cardinality/lower_1 | ... | distribution/q_05 | distribution/q_10 | distribution/q_25 | distribution/median | distribution/q_75 | distribution/q_90 | distribution/q_95 | distribution/q_99 | ints/max | ints/min | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | |||||||||||||||||||||
Gender | 100 | 0 | 0 | 0 | 0 | 100 | 0 | 2.000000 | 2.000100 | 2.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Total Amount | 100 | 0 | 0 | 100 | 0 | 0 | 0 | 99.000024 | 99.004967 | 99.0 | ... | -153.816 | 8.619 | 66.521 | 216.359 | 321.555 | 580.788 | 642.5575 | 795.6 | NaN | NaN |
Customer ID | 100 | 0 | 0 | 0 | 0 | 100 | 0 | 98.000024 | 98.004917 | 98.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Item Price | 100 | 0 | 0 | 100 | 0 | 0 | 0 | 97.000023 | 97.004866 | 97.0 | ... | 10.000 | 25.700 | 40.700 | 76.800 | 111.100 | 135.200 | 139.8000 | 148.9 | NaN | NaN |
Transaction ID | 100 | 0 | 0 | 0 | 0 | 100 | 0 | 99.000024 | 99.004967 | 99.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 28 columns
We can merge these ProfileView objects using the merge method. We assign the result to a new variable and view a few rows of the profile's DataFrame.
merged_profile_view = profile_view1.merge(profile_view2).merge(profile_view3)
merged_profile_view.to_pandas().head()
counts/n | counts/null | types/integral | types/fractional | types/boolean | types/string | types/object | frequent_items/frequent_strings | cardinality/est | cardinality/upper_1 | ... | distribution/q_05 | distribution/q_10 | distribution/q_25 | distribution/median | distribution/q_75 | distribution/q_90 | distribution/q_95 | distribution/q_99 | ints/max | ints/min | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | |||||||||||||||||||||
Gender | 945 | 0 | 0 | 0 | 0 | 945 | 0 | [FrequentItem(value='M', est=489, upper=489, l... | 2.000000 | 2.000100 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Total Amount | 945 | 0 | 0 | 945 | 0 | 0 | 0 | NaN | 849.185065 | 860.300432 | ... | -233.376 | 14.365 | 78.676 | 178.126 | 357.0255 | 580.346 | 657.475 | 804.44 | NaN | NaN |
Customer ID | 945 | 0 | 0 | 0 | 0 | 945 | 0 | [FrequentItem(value='C273096', est=3, upper=2,... | 858.998625 | 873.131713 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Item Price | 945 | 0 | 0 | 945 | 0 | 0 | 0 | NaN | 701.002487 | 710.178225 | ... | 15.000 | 22.700 | 45.200 | 81.200 | 116.7000 | 138.200 | 145.600 | 149.00 | NaN | NaN |
Transaction ID | 945 | 0 | 0 | 0 | 0 | 945 | 0 | [FrequentItem(value='T79960195196', est=3, upp... | 942.466233 | 957.972612 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 28 columns
As expected, we see 945 rows. Unlike the track method, the merge method doesn't update the original objects directly. In other words, we can still inspect the individual profiles views from our subsets of data.
Keep in mind that the track method only works on DatasetProfile objects, while the merge method only operates on DatasetProfileView objects.
profile_view1.to_pandas().head()
counts/n | counts/null | types/integral | types/fractional | types/boolean | types/string | types/object | cardinality/est | cardinality/upper_1 | cardinality/lower_1 | ... | distribution/q_05 | distribution/q_10 | distribution/q_25 | distribution/median | distribution/q_75 | distribution/q_90 | distribution/q_95 | distribution/q_99 | ints/max | ints/min | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | |||||||||||||||||||||
Gender | 100 | 0 | 0 | 0 | 0 | 100 | 0 | 2.000000 | 2.000100 | 2.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Total Amount | 100 | 0 | 0 | 100 | 0 | 0 | 0 | 99.000024 | 99.004967 | 99.0 | ... | -153.816 | 8.619 | 66.521 | 216.359 | 321.555 | 580.788 | 642.5575 | 795.6 | NaN | NaN |
Customer ID | 100 | 0 | 0 | 0 | 0 | 100 | 0 | 98.000024 | 98.004917 | 98.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Item Price | 100 | 0 | 0 | 100 | 0 | 0 | 0 | 97.000023 | 97.004866 | 97.0 | ... | 10.000 | 25.700 | 40.700 | 76.800 | 111.100 | 135.200 | 139.8000 | 148.9 | NaN | NaN |
Transaction ID | 100 | 0 | 0 | 0 | 0 | 100 | 0 | 99.000024 | 99.004967 | 99.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 28 columns
In WhyLabs, profile merging is done automatically. If you have a WhyLabs dataset with a daily batch frequency of 1 day, then any profiles uploaded during that day will automatically merged for a day-level view of your data.
There's a lot you can do with the profiles you just created. You can take a look at our other examples at https://whylogs.readthedocs.io/en/latest/examples !