Hello there! If you've come to this tutorial, perhaps you are wondering what can you do after you generated your first (or maybe not the first) profile. Well, a good practice is to store these profiles as lightweight files, which is one of the cool features whylogs
brings to the table.
Here we will check different flavors of writing, and you can check which one of these will meet your current needs. Shall we?
In order for us to get started, let's take a very simple example dataset and profile it.
import pandas as pd
data = {
"col_1": [1.0, 2.2, 0.1, 1.2],
"col_2": ["some", "text", "column", "example"],
"col_3": [4, 2, 3, 5]
}
df = pd.DataFrame(data)
df.head()
col_1 | col_2 | col_3 | |
---|---|---|---|
0 | 1.0 | some | 4 |
1 | 2.2 | text | 2 |
2 | 0.1 | column | 3 |
3 | 1.2 | example | 5 |
import whylogs as why
profile_results = why.log(df)
type(profile_results)
whylogs.api.logger.result_set.ProfileResultSet
And now we can check its collected metrics by transforming it into a DatasetProfileView
profile_view = profile_results.view()
profile_view.to_pandas()
counts/n | counts/null | types/integral | types/fractional | types/boolean | types/string | types/object | cardinality/est | cardinality/upper_1 | cardinality/lower_1 | ... | distribution/n | distribution/max | distribution/min | distribution/q_10 | distribution/q_25 | distribution/median | distribution/q_75 | distribution/q_90 | ints/max | ints/min | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | |||||||||||||||||||||
col_2 | 4 | 0 | 0 | 0 | 0 | 4 | 0 | 4.0 | 4.0002 | 4.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
col_1 | 4 | 0 | 0 | 4 | 0 | 0 | 0 | 4.0 | 4.0002 | 4.0 | ... | 4.0 | 2.2 | 0.1 | 0.1 | 1.0 | 1.2 | 2.2 | 2.2 | NaN | NaN |
col_3 | 4 | 0 | 4 | 0 | 0 | 0 | 0 | 4.0 | 4.0002 | 4.0 | ... | 4.0 | 5.0 | 2.0 | 2.0 | 3.0 | 4.0 | 5.0 | 5.0 | 5.0 | 2.0 |
3 rows × 24 columns
Cool! So now that we have a proper profile created, let's see how we can persist is as a file.
The first and most straight-forward way of persisting a whylogs
profile as a file is to write it directly to your disk. Our API makes it possible with the following commands. You could either write it from the ProfileResultSet
profile_results.writer("local").write(dest="my_profile.bin")
If you want, you can also not specify a dest
, but an optional base_dir
, which will write your profile with its timestamp to this base directory you want. Let's see how:
profile_results.writer("local").option(base_dir="my_directory").write()
Or from the DatasetProfileView
directly, with a path
profile_view.write(path="my_profile.bin")
And if it couldn't get any more convenient, you can also use the same logic for logging to also write your profile, like:
why.write(profile=profile_view, base_dir="my_directory/my_profile.bin")
import os
os.listdir("./my_directory")
['profile_2022-05-26 20:04:07.484778.bin', 'my_profile.bin']
And that's it! Now you can go ahead and decide where and how to store these profiles, for further inspection and guaranteeing your data and ML model pipelines are generating useful and quality data for your end users.
From an enterprise perspective, it can be interesting to use s3
buckets to store your profiles instead of manually deciding what to do with them from your local machine. And that is why we have created an integration to do just that!
To try to maintain this example simple enough, we won't use an actual cloud-based storage, but we will mock one with the moto
library. This way, you can test this anywhere without worrying about credentials too much :) In order to keep whylogs
as light as possible, and allow users to extend as they need, we have made s3
an extra dependency.
So let's get started by creating this mocked s3
bucket.
P.S.: if you haven't installed the whylogs[s3]
extra dependency already, uncomment and run the cell below.
# ! pip install "whylogs[s3]"
import boto3
from moto import mock_s3
from moto.s3.responses import DEFAULT_REGION_NAME
BUCKET_NAME = "my_great_bucket"
mocks3 = mock_s3()
mocks3.start()
resource = boto3.resource("s3", region_name=DEFAULT_REGION_NAME)
resource.create_bucket(Bucket=BUCKET_NAME)
s3.Bucket(name='my_great_bucket')
Now that we have created our s3
bucket we will already be able to communicate with the mocked storage object. A good practice here is to declare your access credentials as environment variables. For a production setting, this won't be persisted into code, but this will give you a sense of how to safely use our s3
writer.
import os
os.environ["AWS_ACCESS_KEY_ID"] = "my_key_id"
os.environ["AWS_SECRET_ACCESS_KEY"] = "my_access_key"
profile_results.writer("s3").option(bucket_name=BUCKET_NAME).write()
And you've done it! Seems too good to be true. How would I know if the profiles are there? 🤔 Well, let's investigate them.
s3_client = boto3.client("s3")
objects = s3_client.list_objects(Bucket=BUCKET_NAME)
objects.get("Name", [])
'my_great_bucket'
objects.get("Contents", [])
[{'Key': 'profile_2022-05-26 20:04:07.484778.bin', 'LastModified': datetime.datetime(2022, 5, 26, 20, 4, 8, tzinfo=tzutc()), 'ETag': '"ae976af8f52532e3ab58bfe089d2bc44"', 'Size': 1115, 'StorageClass': 'STANDARD', 'Owner': {'DisplayName': 'webfile', 'ID': '75aa57f09aa0c8caeab4f8c24e99d10f8e7faeebf76c078efc7c6caea54ba06a'}}]
And there we have it, our local s3
bucket has our profile written!
If we want to put our profile into a special "directory" - often referred to as prefix - we can do the following instead:
profile_results.writer("s3").option(
bucket_name=BUCKET_NAME,
object_name=f"my_prefix/somewhere/profile_{profile_view.creation_timestamp}.bin"
).write()
objects = s3_client.list_objects(Bucket=BUCKET_NAME)
objects.get("Contents", [])
[{'Key': 'my_prefix/somewhere/profile_2022-05-26 20:04:07.484778.bin', 'LastModified': datetime.datetime(2022, 5, 26, 20, 4, 8, tzinfo=tzutc()), 'ETag': '"ae976af8f52532e3ab58bfe089d2bc44"', 'Size': 1115, 'StorageClass': 'STANDARD', 'Owner': {'DisplayName': 'webfile', 'ID': '75aa57f09aa0c8caeab4f8c24e99d10f8e7faeebf76c078efc7c6caea54ba06a'}}, {'Key': 'profile_2022-05-26 20:04:07.484778.bin', 'LastModified': datetime.datetime(2022, 5, 26, 20, 4, 8, tzinfo=tzutc()), 'ETag': '"ae976af8f52532e3ab58bfe089d2bc44"', 'Size': 1115, 'StorageClass': 'STANDARD', 'Owner': {'DisplayName': 'webfile', 'ID': '75aa57f09aa0c8caeab4f8c24e99d10f8e7faeebf76c078efc7c6caea54ba06a'}}]
Now let's close our connection to our mocked s3
object.
mocks3.stop()
And that's it, you have just written a profile to an s3 bucket! If you want to check other integrations that we've made, please make sure to check out our other examples page.
Hopefully this tutorial will help you get started to save your profiles and make sure to keep your Data and ML Pipelines always Robust and Responsible :)