Artifacts is a new tool that provides Machine Learning Teams with a convenient way to log, version, and access data from all parts of the experimentation pipeline.
We’ve built Comet Artifacts to help Machine Learning teams solve the challenges of iterating on datasets and tracking pipelines where the data generated from one experiment is fed into another experiment.
An Artifact is composed of Artifact versions. Each Artifact has a name, a type, description, tags, and metadata.
An Artifact version is a snapshot of files and assets, arranged in a folder-like logical structure. This snapshot can be tracked using metadata, a version number, tags, and aliases. A version tracks which experiments consumed it, and which experiment produced it.
For a more complete overview check out our full annoucement here
Install Comet and initialize a Project to try out Artifacts
%pip install comet_ml pandas scikit-learn joblib
import comet_ml
comet_ml.init(project_name="guide-artifacts-demo")
For this example, we will use the California Housing Prices Dataset. Lets load the data and create a training and test set.
import pandas as pd
from sklearn.datasets import fetch_california_housing as load_data
from sklearn.model_selection import train_test_split
dataset = load_data()
X, y = dataset.data, dataset.target
featurecols = dataset.feature_names
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
train_df = pd.DataFrame(X_train, columns=featurecols)
test_df = pd.DataFrame(X_test, columns=featurecols)
train_df["target"] = y_train
test_df["target"] = y_test
train_df.head()
import os
os.makedirs("./datasets", exist_ok=True)
train_df.to_csv("./datasets/train.csv", index=False)
test_df.to_csv("./datasets/test.csv", index=False)
Let's track our dataset with an Artifact. In order to create an Artifact, you will have to provide a name for it. You also have the option of providing additional information about the Artifact. You can provide a type string that identifies what kind of Artifact you are uploading (a model, dataset, etc).
You can add alias identifiers to the Artifact, such as "test data" or "staging model". Later in this tutorial we will show you how Artifacts can be retrieved based on these aliases.
Finally, you can attach a metadata dictionary to both the individual data assets uploaded to an Artifact as well as the Artifact itself. You can add any additional information about your Artifact in this dictionary.
# Create a Comet Artifact
artifact = comet_ml.Artifact(
name="california",
artifact_type="dataset",
aliases=["raw"],
metadata={"task": "regression"},
)
# Add files to the Artifact
for split, asset in zip(
["train", "test"], ["./datasets/train.csv", "./datasets/test.csv"]
):
artifact.add(asset, metadata={"dataset_stage": "raw", "dataset_split": split})
experiment = comet_ml.Experiment()
experiment.add_tag("upload")
experiment.log_artifact(artifact)
experiment.end()
In your Workspace, you will see an Artifacts tab where you can view the data that has been uploaded.
Clicking on the Artifact will bring up the Version information and associated Metadata.
Now that we have an Artifact tracking our dataset, let's move on to using it to train a model!
First, let's make a directory to save our Artifacts.
! mkdir ./artifacts
We can fetch the Artifact we need using its name, and either the version or alias.
experiment = comet_ml.Experiment()
experiment.add_tag("train")
# Fetch the Artifact object from Comet
name = "california"
version_or_alias = "raw"
artifact = experiment.get_artifact(name, version_or_alias=version_or_alias)
# Download Artifact
output_path = "./artifacts"
artifact.download(output_path, overwrite_strategy="PRESERVE")
from sklearn.linear_model import LinearRegression
from joblib import dump
# Load Data from Artifact
train_df = pd.read_csv("./artifacts/train.csv")
test_df = pd.read_csv("./artifacts/test.csv")
y_train = train_df.pop("target").values
X_train = train_df.values
y_test = test_df.pop("target").values
X_test = test_df.values
# Initialize Model
model = LinearRegression()
model.fit(X_train, y_train)
# Evaluate Model
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
experiment.log_metric("train-score", train_score)
experiment.log_metric("test-score", test_score)
# Save Model
model_path = "./linear-model.pkl"
dump(model, model_path)
Let's log the model we just trained as an Artifact.
# Log Model as an Artifact
model_artifact = comet_ml.Artifact(
"housing-model", artifact_type="model", aliases=["baseline"]
)
model_artifact.add(model_path)
experiment.log_artifact(model_artifact)
You can view the Artifacts Produced and Consumed by an Experiment in the "Assets and Artifacts" tab under Artifacts. Toggle the direction selector to filter by Input, which refers to Artifacts that were consumed, and Output which refers to Artifacts that were produced
experiment.display(tab="assets")
experiment.end()
Our scores on the raw dataset were not that great. Why don't we scale the data and update our Artifact to reflect this.
# Scores aren't great, so lets scale the features
from sklearn.preprocessing import StandardScaler as Scaler
experiment = comet_ml.Experiment()
experiment.add_tag("upload")
X_scaler = Scaler().fit(X_train)
y_scaler = Scaler().fit(y_train.reshape(-1, 1))
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)
y_train_scaled = y_scaler.transform(y_train.reshape(-1, 1))
y_test_scaled = y_scaler.transform(y_test.reshape(-1, 1))
train_scaled_df = pd.DataFrame(X_train, columns=featurecols)
test_scaled_df = pd.DataFrame(X_test, columns=featurecols)
train_scaled_df["target"] = y_train
test_scaled_df["target"] = y_test
train_scaled_df.to_csv("./datasets/train-scaled.csv")
test_scaled_df.to_csv("./datasets/test-scaled.csv")
# Update Artifact with Scaled Data
scaled_dataset_artifact = comet_ml.Artifact(
"california",
artifact_type="dataset",
aliases=["standard-scaled"],
metadata={"task": "regression"},
)
# Add files to the Artifact
for split, asset in zip(
["train", "test"], ["./datasets/train-scaled.csv", "./datasets/test-scaled.csv"]
):
scaled_dataset_artifact.add(
asset, metadata={"dataset_stage": "standard-scaled", "dataset_split": split}
)
experiment.log_artifact(scaled_dataset_artifact)
experiment.end()
experiment = comet_ml.Experiment()
experiment.add_tag("train")
# Fetch the Artifact object from Comet
name = "california"
version_or_alias = "standard-scaled"
artifact = experiment.get_artifact(name, version_or_alias=version_or_alias)
# Download Artifact
output_path = "./artifacts"
artifact.download(output_path, overwrite_strategy="PRESERVE")
# Load Data from Artifact
train_df = pd.read_csv("./artifacts/train-scaled.csv")
test_df = pd.read_csv("./artifacts/test-scaled.csv")
y_train = train_df.pop("target").values
X_train = train_df.values
y_test = test_df.pop("target").values
X_test = test_df.values
# Initialize Model
model = LinearRegression()
model.fit(X_train, y_train)
# Evaluate Model
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
experiment.log_metric("train-score", train_score)
experiment.log_metric("test-score", test_score)
experiment.end()
Doesn't look like the scaling helped :( Back to the drawing board!