Using MLflow with Tune¶

(tune-mlflow-ref)=

MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. It currently offers four components, including MLflow Tracking to record and query experiments, including code, data, config, and results.

{image}
:align: center
:alt: MLflow
:height: 80px
:target: https://www.mlflow.org/

Ray Tune currently offers two lightweight integrations for MLflow Tracking. One is the {ref}MLflowLoggerCallback <tune-mlflow-logger>, which automatically logs metrics reported to Tune to the MLflow Tracking API.

The other one is the {ref}setup_mlflow <tune-mlflow-setup> function, which can be used with the function API. It automatically initializes the MLflow API with Tune's training information and creates a run for each Tune trial. Then within your training function, you can just use the MLflow like you would normally do, e.g. using mlflow.log_metrics() or even mlflow.autolog() to log to your training process.

{contents}
:backlinks: none
:local: true

Running an MLflow Example¶

In the following example we're going to use both of the above methods, namely the MLflowLoggerCallback and the setup_mlflow function to log metrics. Let's start with a few crucial imports:

In [1]:

import os
import tempfile
import time

import mlflow

from ray import train, tune
from ray.air.integrations.mlflow import MLflowLoggerCallback, setup_mlflow

Next, let's define an easy training function (a Tune Trainable) that iteratively computes steps and evaluates intermediate scores that we report to Tune.

In [2]:

def evaluation_fn(step, width, height):
    return (0.1 + width * step / 100) ** (-1) + height * 0.1


def train_function(config):
    width, height = config["width"], config["height"]

    for step in range(config.get("steps", 100)):
        # Iterative training function - can be any arbitrary training procedure
        intermediate_score = evaluation_fn(step, width, height)
        # Feed the score back to Tune.
        train.report({"iterations": step, "mean_loss": intermediate_score})
        time.sleep(0.1)

Given an MLFlow tracking URI, you can now simply use the MLflowLoggerCallback as a callback argument to your RunConfig():

In [3]:

def tune_with_callback(mlflow_tracking_uri, finish_fast=False):
    tuner = tune.Tuner(
        train_function,
        tune_config=tune.TuneConfig(num_samples=5),
        run_config=train.RunConfig(
            name="mlflow",
            callbacks=[
                MLflowLoggerCallback(
                    tracking_uri=mlflow_tracking_uri,
                    experiment_name="mlflow_callback_example",
                    save_artifact=True,
                )
            ],
        ),
        param_space={
            "width": tune.randint(10, 100),
            "height": tune.randint(0, 100),
            "steps": 5 if finish_fast else 100,
        },
    )
    results = tuner.fit()

To use the setup_mlflow utility, you simply call this function in your training function. Note that we also use mlflow.log_metrics(...) to log metrics to MLflow. Otherwise, this version of our training function is identical to its original.

In [4]:

def train_function_mlflow(config):
    tracking_uri = config.pop("tracking_uri", None)
    setup_mlflow(
        config,
        experiment_name="setup_mlflow_example",
        tracking_uri=tracking_uri,
    )

    # Hyperparameters
    width, height = config["width"], config["height"]

    for step in range(config.get("steps", 100)):
        # Iterative training function - can be any arbitrary training procedure
        intermediate_score = evaluation_fn(step, width, height)
        # Log the metrics to mlflow
        mlflow.log_metrics(dict(mean_loss=intermediate_score), step=step)
        # Feed the score back to Tune.
        train.report({"iterations": step, "mean_loss": intermediate_score})
        time.sleep(0.1)

With this new objective function ready, you can now create a Tune run with it as follows:

In [5]:

def tune_with_setup(mlflow_tracking_uri, finish_fast=False):
    # Set the experiment, or create a new one if does not exist yet.
    mlflow.set_tracking_uri(mlflow_tracking_uri)
    mlflow.set_experiment(experiment_name="setup_mlflow_example")

    tuner = tune.Tuner(
        train_function_mlflow,
        tune_config=tune.TuneConfig(num_samples=5),
        run_config=train.RunConfig(
            name="mlflow",
        ),
        param_space={
            "width": tune.randint(10, 100),
            "height": tune.randint(0, 100),
            "steps": 5 if finish_fast else 100,
            "tracking_uri": mlflow.get_tracking_uri(),
        },
    )
    results = tuner.fit()

If you hapen to have an MLFlow tracking URI, you can set it below in the mlflow_tracking_uri variable and set smoke_test=False. Otherwise, you can just run a quick test of the tune_function and tune_decorated functions without using MLflow.

In [6]:

smoke_test = True

if smoke_test:
    mlflow_tracking_uri = os.path.join(tempfile.gettempdir(), "mlruns")
else:
    mlflow_tracking_uri = "<MLFLOW_TRACKING_URI>"

tune_with_callback(mlflow_tracking_uri, finish_fast=smoke_test)
if not smoke_test:
    df = mlflow.search_runs(
        [mlflow.get_experiment_by_name("mlflow_callback_example").experiment_id]
    )
    print(df)

tune_with_setup(mlflow_tracking_uri, finish_fast=smoke_test)
if not smoke_test:
    df = mlflow.search_runs(
        [mlflow.get_experiment_by_name("setup_mlflow_example").experiment_id]
    )
    print(df)

2022-12-22 10:37:53,580	INFO worker.py:1542 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265

Tune Status

Current time:	2022-12-22 10:38:04
Running for:	00:00:06.73
Memory:	10.4/16.0 GiB

System Info

Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/4.03 GiB heap, 0.0/2.0 GiB objects

Trial Status

Trial name	status	loc	height	width	loss	iter	total time (s)	iterations	neg_mean_loss
train_function_b275b_00000	TERMINATED	127.0.0.1:801	66	36	7.24935	5	0.587302	4	-7.24935
train_function_b275b_00001	TERMINATED	127.0.0.1:813	33	35	3.96667	5	0.507423	4	-3.96667
train_function_b275b_00002	TERMINATED	127.0.0.1:814	75	29	8.29365	5	0.518995	4	-8.29365
train_function_b275b_00003	TERMINATED	127.0.0.1:815	28	63	3.18168	5	0.567739	4	-3.18168
train_function_b275b_00004	TERMINATED	127.0.0.1:816	20	18	3.21951	5	0.526536	4	-3.21951

Trial Progress

Trial name	date	done	experiment_id	experiment_tag	hostname	iterations	iterations_since_restore	mean_loss	neg_mean_loss	node_ip	pid	time_since_restore	time_this_iter_s	time_total_s	timestamp	training_iteration	trial_id	warmup_time
train_function_b275b_00000	2022-12-22_10-38-01	True	28feaa4dd8ab4edab810e8109e77502e	0_height=66,width=36	kais-macbook-pro.anyscale.com.beta.tailscale.net	4	5	7.24935	-7.24935	127.0.0.1	801	0.587302	0.126818	0.587302	1671705481	5	b275b_00000	0.00293493
train_function_b275b_00001	2022-12-22_10-38-04	True	245010d0c3d0439ebfb664764ae9db3c	1_height=33,width=35	kais-macbook-pro.anyscale.com.beta.tailscale.net	4	5	3.96667	-3.96667	127.0.0.1	813	0.507423	0.122086	0.507423	1671705484	5	b275b_00001	0.00553799
train_function_b275b_00002	2022-12-22_10-38-04	True	898afbf9b906448c980f399c72a2324c	2_height=75,width=29	kais-macbook-pro.anyscale.com.beta.tailscale.net	4	5	8.29365	-8.29365	127.0.0.1	814	0.518995	0.123554	0.518995	1671705484	5	b275b_00002	0.0040431
train_function_b275b_00003	2022-12-22_10-38-04	True	03a4476f82734642b6ab0a5040ca58f8	3_height=28,width=63	kais-macbook-pro.anyscale.com.beta.tailscale.net	4	5	3.18168	-3.18168	127.0.0.1	815	0.567739	0.125471	0.567739	1671705484	5	b275b_00003	0.00406194
train_function_b275b_00004	2022-12-22_10-38-04	True	ff8c7c55ce6e404f9b0552c17f7a0c40	4_height=20,width=18	kais-macbook-pro.anyscale.com.beta.tailscale.net	4	5	3.21951	-3.21951	127.0.0.1	816	0.526536	0.123327	0.526536	1671705484	5	b275b_00004	0.00332022

2022-12-22 10:38:04,477	INFO tune.py:772 -- Total run time: 7.99 seconds (6.71 seconds for the tuning loop).

Tune Status

Current time:	2022-12-22 10:38:11
Running for:	00:00:07.00
Memory:	10.7/16.0 GiB

System Info

Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/4.03 GiB heap, 0.0/2.0 GiB objects

Trial Status

Trial name	status	loc	height	width	loss	iter	total time (s)	iterations	neg_mean_loss
train_function_mlflow_b73bd_00000	TERMINATED	127.0.0.1:842	37	68	4.05461	5	0.750435	4	-4.05461
train_function_mlflow_b73bd_00001	TERMINATED	127.0.0.1:853	50	20	6.11111	5	0.652748	4	-6.11111
train_function_mlflow_b73bd_00002	TERMINATED	127.0.0.1:854	38	83	4.0924	5	0.6513	4	-4.0924
train_function_mlflow_b73bd_00003	TERMINATED	127.0.0.1:855	15	93	1.76178	5	0.650586	4	-1.76178
train_function_mlflow_b73bd_00004	TERMINATED	127.0.0.1:856	75	43	8.04945	5	0.656046	4	-8.04945

Trial Progress

Trial name	date	done	experiment_id	experiment_tag	hostname	iterations	iterations_since_restore	mean_loss	neg_mean_loss	node_ip	pid	time_since_restore	time_this_iter_s	time_total_s	timestamp	training_iteration	trial_id	warmup_time
train_function_mlflow_b73bd_00000	2022-12-22_10-38-08	True	62703cfe82e54d74972377fbb525b000	0_height=37,width=68	kais-macbook-pro.anyscale.com.beta.tailscale.net	4	5	4.05461	-4.05461	127.0.0.1	842	0.750435	0.108625	0.750435	1671705488	5	b73bd_00000	0.0030272
train_function_mlflow_b73bd_00001	2022-12-22_10-38-11	True	03ea89852115465392ed318db8021614	1_height=50,width=20	kais-macbook-pro.anyscale.com.beta.tailscale.net	4	5	6.11111	-6.11111	127.0.0.1	853	0.652748	0.110796	0.652748	1671705491	5	b73bd_00001	0.00303078
train_function_mlflow_b73bd_00002	2022-12-22_10-38-11	True	3731fc2966f9453ba58c650d89035ab4	2_height=38,width=83	kais-macbook-pro.anyscale.com.beta.tailscale.net	4	5	4.0924	-4.0924	127.0.0.1	854	0.6513	0.108578	0.6513	1671705491	5	b73bd_00002	0.00310016
train_function_mlflow_b73bd_00003	2022-12-22_10-38-11	True	fb35841742b348b9912d10203c730f1e	3_height=15,width=93	kais-macbook-pro.anyscale.com.beta.tailscale.net	4	5	1.76178	-1.76178	127.0.0.1	855	0.650586	0.109097	0.650586	1671705491	5	b73bd_00003	0.0576491
train_function_mlflow_b73bd_00004	2022-12-22_10-38-11	True	6d3cbf9ecc3446369e607ff78c67bc29	4_height=75,width=43	kais-macbook-pro.anyscale.com.beta.tailscale.net	4	5	8.04945	-8.04945	127.0.0.1	856	0.656046	0.109869	0.656046	1671705491	5	b73bd_00004	0.00265694

2022-12-22 10:38:11,514	INFO tune.py:772 -- Total run time: 7.01 seconds (6.98 seconds for the tuning loop).

This completes our Tune and MLflow walk-through. In the following sections you can find more details on the API of the Tune-MLflow integration.

MLflow AutoLogging¶

You can also check out {doc}here </tune/examples/includes/mlflow_ptl_example> for an example on how you can leverage MLflow auto-logging, in this case with Pytorch Lightning

MLflow Logger API¶

(tune-mlflow-logger)=

{eval-rst}
.. autoclass:: ray.air.integrations.mlflow.MLflowLoggerCallback
   :noindex:

MLflow setup API¶

(tune-mlflow-setup)=

{eval-rst}
.. autofunction:: ray.air.integrations.mlflow.setup_mlflow
   :noindex:

More MLflow Examples¶

{doc}/tune/examples/includes/mlflow_ptl_example: Example for using MLflow and Pytorch Lightning with Ray Tune.