In this example, we train a simple XGBoost model and log the training results to Comet ML. We also save the resulting model checkpoints as artifacts.
Let's start with installing our dependencies:
!pip install -qU "ray[tune]" scikit-learn xgboost_ray comet_ml
Then we need some imports:
import ray
from ray.air.config import RunConfig, ScalingConfig
from ray.air.result import Result
from ray.train.xgboost import XGBoostTrainer
from ray.air.integrations.comet import CometLoggerCallback
We define a simple function that returns our training dataset as a Dataset:
def get_train_dataset() -> ray.data.Dataset:
dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")
return dataset
Now we define a simple training function. All the magic happens within the CometLoggerCallback
:
CometLoggerCallback(
project_name=comet_project,
save_checkpoints=True,
)
It will automatically log all results to Comet ML and upload the checkpoints as artifacts. It assumes you're logged in into Comet via an API key or your ~./.comet.config
.
def train_model(train_dataset: ray.data.Dataset, comet_project: str) -> Result:
"""Train a simple XGBoost model and return the result."""
trainer = XGBoostTrainer(
scaling_config=ScalingConfig(num_workers=2),
params={"tree_method": "auto"},
label_column="target",
datasets={"train": train_dataset},
num_boost_round=10,
run_config=RunConfig(
callbacks=[
# This is the part needed to enable logging to Comet ML.
# It assumes Comet ML can find a valid API (e.g. by setting
# the ``COMET_API_KEY`` environment variable).
CometLoggerCallback(
project_name=comet_project,
save_checkpoints=True,
)
]
),
)
result = trainer.fit()
return result
Let's kick off a run:
comet_project = "ray_air_example"
train_dataset = get_train_dataset()
result = train_model(train_dataset=train_dataset, comet_project=comet_project)
2022-05-19 15:19:17,237 INFO services.py:1483 -- View the Ray dashboard at http://127.0.0.1:8265
Trial name | status | loc | iter | total time (s) | train-rmse |
---|---|---|---|---|---|
XGBoostTrainer_ac544_00000 | TERMINATED | 127.0.0.1:19852 | 10 | 9.7203 | 0.030717 |
COMET WARNING: As you are running in a Jupyter environment, you will need to call `experiment.end()` when finished to ensure all metrics and code are logged before exiting. (raylet) 2022-05-19 15:19:21,584 INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61222 --object-store-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62873 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:61938 --redis-password=5241590000000000 --startup-token=16 --runtime-env-hash=-2010331134 COMET INFO: Experiment is live on comet.ml https://www.comet.ml/krfricke/ray-air-example/ecd3726ca127497ba7386003a249fad6 COMET WARNING: Failed to add tag(s) None to the experiment COMET WARNING: Empty mapping given to log_params({}); ignoring (GBDTTrainable pid=19852) UserWarning: Dataset 'train' has 1 blocks, which is less than the `num_workers` 2. This dataset will be automatically repartitioned to 2 blocks. (raylet) 2022-05-19 15:19:24,628 INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61222 --object-store-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62873 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:61938 --redis-password=5241590000000000 --startup-token=17 --runtime-env-hash=-2010331069 (GBDTTrainable pid=19852) 2022-05-19 15:19:25,961 INFO main.py:980 -- [RayXGBoost] Created 2 new actors (2 total actors). Waiting until actors are ready for training. (raylet) 2022-05-19 15:19:26,830 INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61222 --object-store-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62873 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:61938 --redis-password=5241590000000000 --startup-token=18 --runtime-env-hash=-2010331069 (raylet) 2022-05-19 15:19:26,918 INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61222 --object-store-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62873 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:61938 --redis-password=5241590000000000 --startup-token=20 --runtime-env-hash=-2010331134 (raylet) 2022-05-19 15:19:26,922 INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61222 --object-store-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62873 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:61938 --redis-password=5241590000000000 --startup-token=21 --runtime-env-hash=-2010331134 (raylet) 2022-05-19 15:19:26,922 INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61222 --object-store-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62873 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:61938 --redis-password=5241590000000000 --startup-token=22 --runtime-env-hash=-2010331134 (raylet) 2022-05-19 15:19:26,923 INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61222 --object-store-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62873 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:61938 --redis-password=5241590000000000 --startup-token=19 --runtime-env-hash=-2010331134 (GBDTTrainable pid=19852) 2022-05-19 15:19:29,272 INFO main.py:1025 -- [RayXGBoost] Starting XGBoost training. (_RemoteRayXGBoostActor pid=19876) [15:19:29] task [xgboost.ray]:4505889744 got new rank 1 (_RemoteRayXGBoostActor pid=19875) [15:19:29] task [xgboost.ray]:6941849424 got new rank 0 COMET WARNING: The given value of the metric episodes_total was None; ignoring COMET WARNING: The given value of the metric timesteps_total was None; ignoring COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 1.0.0 created
Result for XGBoostTrainer_ac544_00000: date: 2022-05-19_15-19-30 done: false experiment_id: d3007bd6a2734b328fd90385485c5a8d hostname: Kais-MacBook-Pro.local iterations_since_restore: 1 node_ip: 127.0.0.1 pid: 19852 should_checkpoint: true time_since_restore: 6.529659032821655 time_this_iter_s: 6.529659032821655 time_total_s: 6.529659032821655 timestamp: 1652969970 timesteps_since_restore: 0 train-rmse: 0.357284 training_iteration: 1 trial_id: ac544_00000 warmup_time: 0.003961086273193359
COMET INFO: Scheduling the upload of 3 assets for a size of 2.48 KB, this can take some time
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:1.0.0' has started uploading asynchronously
COMET WARNING: The given value of the metric episodes_total was None; ignoring
COMET WARNING: The given value of the metric timesteps_total was None; ignoring
COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 2.0.0 created (previous was: 1.0.0)
COMET INFO: Scheduling the upload of 3 assets for a size of 3.86 KB, this can take some time
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:2.0.0' has started uploading asynchronously
COMET WARNING: The given value of the metric episodes_total was None; ignoring
COMET WARNING: The given value of the metric timesteps_total was None; ignoring
COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 3.0.0 created (previous was: 2.0.0)
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:1.0.0' has been fully uploaded successfully
COMET INFO: Scheduling the upload of 3 assets for a size of 5.31 KB, this can take some time
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:3.0.0' has started uploading asynchronously
COMET WARNING: The given value of the metric episodes_total was None; ignoring
COMET WARNING: The given value of the metric timesteps_total was None; ignoring
COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 4.0.0 created (previous was: 3.0.0)
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:2.0.0' has been fully uploaded successfully
COMET INFO: Scheduling the upload of 3 assets for a size of 6.76 KB, this can take some time
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:4.0.0' has started uploading asynchronously
COMET WARNING: The given value of the metric episodes_total was None; ignoring
COMET WARNING: The given value of the metric timesteps_total was None; ignoring
COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 5.0.0 created (previous was: 4.0.0)
COMET INFO: Scheduling the upload of 3 assets for a size of 8.21 KB, this can take some time
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:3.0.0' has been fully uploaded successfully
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:5.0.0' has started uploading asynchronously
COMET WARNING: The given value of the metric episodes_total was None; ignoring
COMET WARNING: The given value of the metric timesteps_total was None; ignoring
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:4.0.0' has been fully uploaded successfully
COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 6.0.0 created (previous was: 5.0.0)
COMET INFO: Scheduling the upload of 3 assets for a size of 9.87 KB, this can take some time
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:6.0.0' has started uploading asynchronously
COMET WARNING: The given value of the metric episodes_total was None; ignoring
COMET WARNING: The given value of the metric timesteps_total was None; ignoring
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:5.0.0' has been fully uploaded successfully
COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 7.0.0 created (previous was: 6.0.0)
COMET INFO: Scheduling the upload of 3 assets for a size of 11.46 KB, this can take some time
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:7.0.0' has started uploading asynchronously
COMET WARNING: The given value of the metric episodes_total was None; ignoring
COMET WARNING: The given value of the metric timesteps_total was None; ignoring
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:6.0.0' has been fully uploaded successfully
COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 8.0.0 created (previous was: 7.0.0)
COMET INFO: Scheduling the upload of 3 assets for a size of 12.84 KB, this can take some time
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:8.0.0' has started uploading asynchronously
COMET WARNING: The given value of the metric episodes_total was None; ignoring
COMET WARNING: The given value of the metric timesteps_total was None; ignoring
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:7.0.0' has been fully uploaded successfully
COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 9.0.0 created (previous was: 8.0.0)
COMET INFO: Scheduling the upload of 3 assets for a size of 14.36 KB, this can take some time
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:9.0.0' has started uploading asynchronously
COMET WARNING: The given value of the metric episodes_total was None; ignoring
COMET WARNING: The given value of the metric timesteps_total was None; ignoring
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:8.0.0' has been fully uploaded successfully
COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 10.0.0 created (previous was: 9.0.0)
COMET INFO: Scheduling the upload of 3 assets for a size of 16.37 KB, this can take some time
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:10.0.0' has started uploading asynchronously
(GBDTTrainable pid=19852) 2022-05-19 15:19:33,890 INFO main.py:1519 -- [RayXGBoost] Finished XGBoost training on training data with total N=569 in 7.96 seconds (4.61 pure XGBoost training time).
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:9.0.0' has been fully uploaded successfully
COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 11.0.0 created (previous was: 10.0.0)
COMET INFO: Scheduling the upload of 3 assets for a size of 16.39 KB, this can take some time
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:11.0.0' has started uploading asynchronously
COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO: Data:
COMET INFO: display_summary_level : 1
COMET INFO: url : https://www.comet.ml/krfricke/ray-air-example/ecd3726ca127497ba7386003a249fad6
COMET INFO: Metrics [count] (min, max):
COMET INFO: iterations_since_restore [10] : (1, 10)
COMET INFO: time_since_restore [10] : (6.529659032821655, 9.720295906066895)
COMET INFO: time_this_iter_s [10] : (0.3124058246612549, 6.529659032821655)
COMET INFO: time_total_s [10] : (6.529659032821655, 9.720295906066895)
COMET INFO: timestamp [10] : (1652969970, 1652969973)
COMET INFO: timesteps_since_restore : 0
COMET INFO: train-rmse [10] : (0.030717, 0.357284)
COMET INFO: training_iteration [10] : (1, 10)
COMET INFO: warmup_time : 0.003961086273193359
COMET INFO: Others:
COMET INFO: Created from : Ray
COMET INFO: Name : XGBoostTrainer_ac544_00000
COMET INFO: experiment_id : d3007bd6a2734b328fd90385485c5a8d
COMET INFO: trial_id : ac544_00000
COMET INFO: System Information:
COMET INFO: date : 2022-05-19_15-19-33
COMET INFO: hostname : Kais-MacBook-Pro.local
COMET INFO: node_ip : 127.0.0.1
COMET INFO: pid : 19852
COMET INFO: Uploads:
COMET INFO: artifact assets : 33 (107.92 KB)
COMET INFO: artifacts : 11
COMET INFO: environment details : 1
COMET INFO: filename : 1
COMET INFO: installed packages : 1
COMET INFO: notebook : 1
COMET INFO: source_code : 1
COMET INFO: ---------------------------
COMET INFO: Uploading metrics, params, and assets to Comet before program termination (may take several seconds)
COMET INFO: The Python SDK has 3600 seconds to finish before aborting...
COMET INFO: Waiting for completion of the file uploads (may take several seconds)
COMET INFO: The Python SDK has 10800 seconds to finish before aborting...
COMET INFO: Still uploading 6 file(s), remaining 21.05 KB/116.69 KB
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:10.0.0' has been fully uploaded successfully
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:11.0.0' has been fully uploaded successfully
Result for XGBoostTrainer_ac544_00000: date: 2022-05-19_15-19-33 done: true experiment_id: d3007bd6a2734b328fd90385485c5a8d experiment_tag: '0' hostname: Kais-MacBook-Pro.local iterations_since_restore: 10 node_ip: 127.0.0.1 pid: 19852 should_checkpoint: true time_since_restore: 9.720295906066895 time_this_iter_s: 0.39761900901794434 time_total_s: 9.720295906066895 timestamp: 1652969973 timesteps_since_restore: 0 train-rmse: 0.030717 training_iteration: 10 trial_id: ac544_00000 warmup_time: 0.003961086273193359
2022-05-19 15:19:35,621 INFO tune.py:753 -- Total run time: 15.75 seconds (14.94 seconds for the tuning loop).
Check out your Comet ML project to see the results!