from whylogs import get_or_create_session
import pandas as pd
%load_ext autoreload
%autoreload 2
The autoreload extension is already loaded. To reload it, use: %reload_ext autoreload
For this example we will create a fake s3 server using moto lib. You should remove this section if you have you own bucket setup on aws. Make sure you have your aws configuration set. By default this mock server creates a server in region "us-east-1"
BUCKET="super_awesome_bucket"
from moto import mock_s3
from moto.s3.responses import DEFAULT_REGION_NAME
import boto3
mocks3 = mock_s3()
mocks3.start()
res = boto3.resource('s3', region_name=DEFAULT_REGION_NAME)
res.create_bucket(Bucket=BUCKET)
s3.Bucket(name='super_awesome_bucket')
We can go by our usual way, load a example csv data
df = pd.read_csv("lending_club_1000.csv")
Seting up whylogs to save your data on s3 can be in several ways. Simplest is to simply create a config file,where each data format can be saved to a specific location. As shown below
CONFIG = """
project: s3_example_project
pipeline: latest_results
verbose: false
writers:
- formats:
- protobuf
output_path: s3://super_awesome_bucket/
path_template: $name/dataset_summary
filename_template: dataset_summary
type: s3
- formats:
- flat
output_path: s3://super_awesome_bucket/
path_template: $name/dataset_summary
filename_template: dataset_summary
type: s3
- formats:
- json
output_path: s3://super_awesome_bucket/
path_template: $name/dataset_summary
filename_template: dataset_summary
type: s3
"""
config_path=".whylogs.yaml"
with open(".whylogs.yaml","w") as file:
file.write(CONFIG)
Checking the content:
%cat .whylogs.yaml
project: s3_example_project pipeline: latest_results verbose: false writers: - formats: - protobuf output_path: s3://super_awesome_bucket/ path_template: $name/dataset_summary filename_template: dataset_summary type: s3 - formats: - flat output_path: s3://super_awesome_bucket/ path_template: $name/dataset_summary filename_template: dataset_summary type: s3 - formats: - json output_path: s3://super_awesome_bucket/ path_template: $name/dataset_summary filename_template: dataset_summary type: s3
If you have a custom name for your config file or place it in a special location you can use the helper function
from whylogs.app.session import load_config, session_from_config
config = load_config(".whylogs.yaml")
session = session_from_config(config)
print(session.get_config().to_yaml())
cache: 1 pipeline: latest_results project: s3_example_project verbose: false with_rotation_time: null writers: - filename_template: <string.Template object at 0x7fd4e78cfc40> formats: - OutputFormat.protobuf output_path: s3://super_awesome_bucket/ path_template: <string.Template object at 0x7fd4e6564a90> - filename_template: <string.Template object at 0x7fd4e78c6b20> formats: - OutputFormat.flat output_path: s3://super_awesome_bucket/ path_template: <string.Template object at 0x7fd4e78c6580> - filename_template: <string.Template object at 0x7fd4e7883550> formats: - OutputFormat.json output_path: s3://super_awesome_bucket/ path_template: <string.Template object at 0x7fd4e78c8ac0>
Otherwise if the file is located in your home directory or current location you are running, you can simply run get_or_create_session()
session= get_or_create_session()
print(session.get_config().to_yaml())
cache: 1 pipeline: latest_results project: s3_example_project verbose: false with_rotation_time: null writers: - filename_template: <string.Template object at 0x7fd4e7812c40> formats: - OutputFormat.protobuf output_path: s3://super_awesome_bucket/ path_template: <string.Template object at 0x7fd4e4a80520> - filename_template: <string.Template object at 0x7fd4e7812130> formats: - OutputFormat.flat output_path: s3://super_awesome_bucket/ path_template: <string.Template object at 0x7fd4e7812250> - filename_template: <string.Template object at 0x7fd4e7812bb0> formats: - OutputFormat.json output_path: s3://super_awesome_bucket/ path_template: <string.Template object at 0x7fd4e78122e0>
with session.logger("dataset_test_s3") as logger:
logger.log_dataframe(df)
client = boto3.client('s3')
objects = client.list_objects(Bucket=BUCKET)
[obj["Key"] for obj in objects["Contents"]]
['dataset_test_s3/dataset_summary/flat_table/dataset_summary.csv', 'dataset_test_s3/dataset_summary/freq_numbers/dataset_summary.json', 'dataset_test_s3/dataset_summary/frequent_strings/dataset_summary.json', 'dataset_test_s3/dataset_summary/histogram/dataset_summary.json', 'dataset_test_s3/dataset_summary/json/dataset_summary.json', 'dataset_test_s3/dataset_summary/protobuf/dataset_summary.bin']
You can define the configure for were the data is save through a configuration file or creating a custom writer.
mocks3.stop()