(tune-mlflow-ref)=
MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. It currently offers four components, including MLflow Tracking to record and query experiments, including code, data, config, and results.
{image}
:align: center
:alt: MLflow
:height: 80px
:target: https://www.mlflow.org/
Ray Tune currently offers two lightweight integrations for MLflow Tracking.
One is the {ref}MLflowLoggerCallback <tune-mlflow-logger>
, which automatically logs
metrics reported to Tune to the MLflow Tracking API.
The other one is the {ref}setup_mlflow <tune-mlflow-setup>
function, which can be
used with the function API. It automatically
initializes the MLflow API with Tune's training information and creates a run for each Tune trial.
Then within your training function, you can just use the
MLflow like you would normally do, e.g. using mlflow.log_metrics()
or even mlflow.autolog()
to log to your training process.
{contents}
:backlinks: none
:local: true
In the following example we're going to use both of the above methods, namely the MLflowLoggerCallback
and
the setup_mlflow
function to log metrics.
Let's start with a few crucial imports:
import os
import tempfile
import time
import mlflow
from ray import train, tune
from ray.air.integrations.mlflow import MLflowLoggerCallback, setup_mlflow
Next, let's define an easy training function (a Tune Trainable
) that iteratively computes steps and evaluates
intermediate scores that we report to Tune.
def evaluation_fn(step, width, height):
return (0.1 + width * step / 100) ** (-1) + height * 0.1
def train_function(config):
width, height = config["width"], config["height"]
for step in range(config.get("steps", 100)):
# Iterative training function - can be any arbitrary training procedure
intermediate_score = evaluation_fn(step, width, height)
# Feed the score back to Tune.
train.report({"iterations": step, "mean_loss": intermediate_score})
time.sleep(0.1)
Given an MLFlow tracking URI, you can now simply use the MLflowLoggerCallback
as a callback
argument to
your RunConfig()
:
def tune_with_callback(mlflow_tracking_uri, finish_fast=False):
tuner = tune.Tuner(
train_function,
tune_config=tune.TuneConfig(num_samples=5),
run_config=train.RunConfig(
name="mlflow",
callbacks=[
MLflowLoggerCallback(
tracking_uri=mlflow_tracking_uri,
experiment_name="mlflow_callback_example",
save_artifact=True,
)
],
),
param_space={
"width": tune.randint(10, 100),
"height": tune.randint(0, 100),
"steps": 5 if finish_fast else 100,
},
)
results = tuner.fit()
To use the setup_mlflow
utility, you simply call this function in your training function.
Note that we also use mlflow.log_metrics(...)
to log metrics to MLflow.
Otherwise, this version of our training function is identical to its original.
def train_function_mlflow(config):
tracking_uri = config.pop("tracking_uri", None)
setup_mlflow(
config,
experiment_name="setup_mlflow_example",
tracking_uri=tracking_uri,
)
# Hyperparameters
width, height = config["width"], config["height"]
for step in range(config.get("steps", 100)):
# Iterative training function - can be any arbitrary training procedure
intermediate_score = evaluation_fn(step, width, height)
# Log the metrics to mlflow
mlflow.log_metrics(dict(mean_loss=intermediate_score), step=step)
# Feed the score back to Tune.
train.report({"iterations": step, "mean_loss": intermediate_score})
time.sleep(0.1)
With this new objective function ready, you can now create a Tune run with it as follows:
def tune_with_setup(mlflow_tracking_uri, finish_fast=False):
# Set the experiment, or create a new one if does not exist yet.
mlflow.set_tracking_uri(mlflow_tracking_uri)
mlflow.set_experiment(experiment_name="setup_mlflow_example")
tuner = tune.Tuner(
train_function_mlflow,
tune_config=tune.TuneConfig(num_samples=5),
run_config=train.RunConfig(
name="mlflow",
),
param_space={
"width": tune.randint(10, 100),
"height": tune.randint(0, 100),
"steps": 5 if finish_fast else 100,
"tracking_uri": mlflow.get_tracking_uri(),
},
)
results = tuner.fit()
If you hapen to have an MLFlow tracking URI, you can set it below in the mlflow_tracking_uri
variable and set
smoke_test=False
.
Otherwise, you can just run a quick test of the tune_function
and tune_decorated
functions without using MLflow.
smoke_test = True
if smoke_test:
mlflow_tracking_uri = os.path.join(tempfile.gettempdir(), "mlruns")
else:
mlflow_tracking_uri = "<MLFLOW_TRACKING_URI>"
tune_with_callback(mlflow_tracking_uri, finish_fast=smoke_test)
if not smoke_test:
df = mlflow.search_runs(
[mlflow.get_experiment_by_name("mlflow_callback_example").experiment_id]
)
print(df)
tune_with_setup(mlflow_tracking_uri, finish_fast=smoke_test)
if not smoke_test:
df = mlflow.search_runs(
[mlflow.get_experiment_by_name("setup_mlflow_example").experiment_id]
)
print(df)
2022-12-22 10:37:53,580 INFO worker.py:1542 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
Current time: | 2022-12-22 10:38:04 |
Running for: | 00:00:06.73 |
Memory: | 10.4/16.0 GiB |
Trial name | status | loc | height | width | loss | iter | total time (s) | iterations | neg_mean_loss |
---|---|---|---|---|---|---|---|---|---|
train_function_b275b_00000 | TERMINATED | 127.0.0.1:801 | 66 | 36 | 7.24935 | 5 | 0.587302 | 4 | -7.24935 |
train_function_b275b_00001 | TERMINATED | 127.0.0.1:813 | 33 | 35 | 3.96667 | 5 | 0.507423 | 4 | -3.96667 |
train_function_b275b_00002 | TERMINATED | 127.0.0.1:814 | 75 | 29 | 8.29365 | 5 | 0.518995 | 4 | -8.29365 |
train_function_b275b_00003 | TERMINATED | 127.0.0.1:815 | 28 | 63 | 3.18168 | 5 | 0.567739 | 4 | -3.18168 |
train_function_b275b_00004 | TERMINATED | 127.0.0.1:816 | 20 | 18 | 3.21951 | 5 | 0.526536 | 4 | -3.21951 |
Trial name | date | done | episodes_total | experiment_id | experiment_tag | hostname | iterations | iterations_since_restore | mean_loss | neg_mean_loss | node_ip | pid | time_since_restore | time_this_iter_s | time_total_s | timestamp | timesteps_since_restore | timesteps_total | training_iteration | trial_id | warmup_time |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
train_function_b275b_00000 | 2022-12-22_10-38-01 | True | 28feaa4dd8ab4edab810e8109e77502e | 0_height=66,width=36 | kais-macbook-pro.anyscale.com.beta.tailscale.net | 4 | 5 | 7.24935 | -7.24935 | 127.0.0.1 | 801 | 0.587302 | 0.126818 | 0.587302 | 1671705481 | 0 | 5 | b275b_00000 | 0.00293493 | ||
train_function_b275b_00001 | 2022-12-22_10-38-04 | True | 245010d0c3d0439ebfb664764ae9db3c | 1_height=33,width=35 | kais-macbook-pro.anyscale.com.beta.tailscale.net | 4 | 5 | 3.96667 | -3.96667 | 127.0.0.1 | 813 | 0.507423 | 0.122086 | 0.507423 | 1671705484 | 0 | 5 | b275b_00001 | 0.00553799 | ||
train_function_b275b_00002 | 2022-12-22_10-38-04 | True | 898afbf9b906448c980f399c72a2324c | 2_height=75,width=29 | kais-macbook-pro.anyscale.com.beta.tailscale.net | 4 | 5 | 8.29365 | -8.29365 | 127.0.0.1 | 814 | 0.518995 | 0.123554 | 0.518995 | 1671705484 | 0 | 5 | b275b_00002 | 0.0040431 | ||
train_function_b275b_00003 | 2022-12-22_10-38-04 | True | 03a4476f82734642b6ab0a5040ca58f8 | 3_height=28,width=63 | kais-macbook-pro.anyscale.com.beta.tailscale.net | 4 | 5 | 3.18168 | -3.18168 | 127.0.0.1 | 815 | 0.567739 | 0.125471 | 0.567739 | 1671705484 | 0 | 5 | b275b_00003 | 0.00406194 | ||
train_function_b275b_00004 | 2022-12-22_10-38-04 | True | ff8c7c55ce6e404f9b0552c17f7a0c40 | 4_height=20,width=18 | kais-macbook-pro.anyscale.com.beta.tailscale.net | 4 | 5 | 3.21951 | -3.21951 | 127.0.0.1 | 816 | 0.526536 | 0.123327 | 0.526536 | 1671705484 | 0 | 5 | b275b_00004 | 0.00332022 |
2022-12-22 10:38:04,477 INFO tune.py:772 -- Total run time: 7.99 seconds (6.71 seconds for the tuning loop).
Current time: | 2022-12-22 10:38:11 |
Running for: | 00:00:07.00 |
Memory: | 10.7/16.0 GiB |
Trial name | status | loc | height | width | loss | iter | total time (s) | iterations | neg_mean_loss |
---|---|---|---|---|---|---|---|---|---|
train_function_mlflow_b73bd_00000 | TERMINATED | 127.0.0.1:842 | 37 | 68 | 4.05461 | 5 | 0.750435 | 4 | -4.05461 |
train_function_mlflow_b73bd_00001 | TERMINATED | 127.0.0.1:853 | 50 | 20 | 6.11111 | 5 | 0.652748 | 4 | -6.11111 |
train_function_mlflow_b73bd_00002 | TERMINATED | 127.0.0.1:854 | 38 | 83 | 4.0924 | 5 | 0.6513 | 4 | -4.0924 |
train_function_mlflow_b73bd_00003 | TERMINATED | 127.0.0.1:855 | 15 | 93 | 1.76178 | 5 | 0.650586 | 4 | -1.76178 |
train_function_mlflow_b73bd_00004 | TERMINATED | 127.0.0.1:856 | 75 | 43 | 8.04945 | 5 | 0.656046 | 4 | -8.04945 |
Trial name | date | done | episodes_total | experiment_id | experiment_tag | hostname | iterations | iterations_since_restore | mean_loss | neg_mean_loss | node_ip | pid | time_since_restore | time_this_iter_s | time_total_s | timestamp | timesteps_since_restore | timesteps_total | training_iteration | trial_id | warmup_time |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
train_function_mlflow_b73bd_00000 | 2022-12-22_10-38-08 | True | 62703cfe82e54d74972377fbb525b000 | 0_height=37,width=68 | kais-macbook-pro.anyscale.com.beta.tailscale.net | 4 | 5 | 4.05461 | -4.05461 | 127.0.0.1 | 842 | 0.750435 | 0.108625 | 0.750435 | 1671705488 | 0 | 5 | b73bd_00000 | 0.0030272 | ||
train_function_mlflow_b73bd_00001 | 2022-12-22_10-38-11 | True | 03ea89852115465392ed318db8021614 | 1_height=50,width=20 | kais-macbook-pro.anyscale.com.beta.tailscale.net | 4 | 5 | 6.11111 | -6.11111 | 127.0.0.1 | 853 | 0.652748 | 0.110796 | 0.652748 | 1671705491 | 0 | 5 | b73bd_00001 | 0.00303078 | ||
train_function_mlflow_b73bd_00002 | 2022-12-22_10-38-11 | True | 3731fc2966f9453ba58c650d89035ab4 | 2_height=38,width=83 | kais-macbook-pro.anyscale.com.beta.tailscale.net | 4 | 5 | 4.0924 | -4.0924 | 127.0.0.1 | 854 | 0.6513 | 0.108578 | 0.6513 | 1671705491 | 0 | 5 | b73bd_00002 | 0.00310016 | ||
train_function_mlflow_b73bd_00003 | 2022-12-22_10-38-11 | True | fb35841742b348b9912d10203c730f1e | 3_height=15,width=93 | kais-macbook-pro.anyscale.com.beta.tailscale.net | 4 | 5 | 1.76178 | -1.76178 | 127.0.0.1 | 855 | 0.650586 | 0.109097 | 0.650586 | 1671705491 | 0 | 5 | b73bd_00003 | 0.0576491 | ||
train_function_mlflow_b73bd_00004 | 2022-12-22_10-38-11 | True | 6d3cbf9ecc3446369e607ff78c67bc29 | 4_height=75,width=43 | kais-macbook-pro.anyscale.com.beta.tailscale.net | 4 | 5 | 8.04945 | -8.04945 | 127.0.0.1 | 856 | 0.656046 | 0.109869 | 0.656046 | 1671705491 | 0 | 5 | b73bd_00004 | 0.00265694 |
2022-12-22 10:38:11,514 INFO tune.py:772 -- Total run time: 7.01 seconds (6.98 seconds for the tuning loop).
This completes our Tune and MLflow walk-through. In the following sections you can find more details on the API of the Tune-MLflow integration.
You can also check out {doc}here </tune/examples/includes/mlflow_ptl_example>
for an example on how you can
leverage MLflow auto-logging, in this case with Pytorch Lightning
(tune-mlflow-logger)=
{eval-rst}
.. autoclass:: ray.air.integrations.mlflow.MLflowLoggerCallback
:noindex:
(tune-mlflow-setup)=
{eval-rst}
.. autofunction:: ray.air.integrations.mlflow.setup_mlflow
:noindex:
/tune/examples/includes/mlflow_ptl_example
: Example for using MLflow
and Pytorch Lightning with Ray Tune.