Notebook

Online reinforcement learning with Ray AIR¶

In this example, we'll train a reinforcement learning agent using online training.

Online training means that the data from the environment is sampled while we are running the algorithm. In contrast, offline training uses data that has been stored on disk before.

Let's start with installing our dependencies:

In [1]:

!pip install -qU "ray[rllib]" gymnasium

Now we can run some imports:

In [2]:

import argparse
import gymnasium as gym
import os

import numpy as np
import ray
from ray.air import Checkpoint
from ray.air.config import RunConfig
from ray.train.rl.rl_predictor import RLPredictor
from ray.train.rl.rl_trainer import RLTrainer
from ray.air.config import ScalingConfig
from ray.air.result import Result
from ray.rllib.algorithms.bc import BC
from ray.tune.tuner import Tuner

2022-05-19 13:54:16,520	WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!
2022-05-19 13:54:16,531	WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.agents.marwil` has been deprecated. Use `ray.rllib.algorithms.marwil` instead. This will raise an error in the future!

Here we define the training function. It will create an RLTrainer using the PPO algorithm and kick off training on the CartPole-v1 environment:

In [3]:

def train_rl_ppo_online(num_workers: int, use_gpu: bool = False) -> Result:
    print("Starting online training")
    trainer = RLTrainer(
        run_config=RunConfig(stop={"training_iteration": 5}),
        scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
        algorithm="PPO",
        config={
            "env": "CartPole-v1",
            "framework": "tf",
        },
    )
    # Todo (krfricke/xwjiang): Enable checkpoint config in RunConfig
    # result = trainer.fit()
    tuner = Tuner(
        trainer,
        _tuner_kwargs={"checkpoint_at_end": True},
    )
    result = tuner.fit()[0]
    return result

Once we trained our RL policy, we want to evaluate it on a fresh environment. For this, we will also define a utility function:

In [4]:

def evaluate_using_checkpoint(checkpoint: Checkpoint, num_episodes) -> list:
    predictor = RLPredictor.from_checkpoint(checkpoint)

    env = gym.make("CartPole-v1")

    rewards = []
    for i in range(num_episodes):
        obs, _ = env.reset()
        reward = 0.0
        terminated = truncated = False
        while not terminated and not truncated:
            action = predictor.predict(np.array([obs]))
            obs, r, terminated, truncated, _ = env.step(action[0])
            reward += r
        rewards.append(reward)

    return rewards

Let's put it all together. First, we run training:

In [5]:

result = train_rl_ppo_online(num_workers=2, use_gpu=False)

2022-05-19 13:54:16,582	WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.agents.dqn.dqn.DEFAULT_CONFIG` has been deprecated. Use `ray.rllib.agents.dqn.dqn.DQNConfig(...)` instead. This will raise an error in the future!

Starting online training

2022-05-19 13:54:19,326	INFO services.py:1483 -- View the Ray dashboard at http://127.0.0.1:8267

== Status ==
Current time: 2022-05-19 13:54:57 (running for 00:00:35.99)
Memory usage on this node: 9.6/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/4.54 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/kai/ray_results/AIRPPOTrainer_2022-05-19_13-54-16
Number of trials: 1/1 (1 TERMINATED)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
AIRPPOTrainer_cd8d6_00000	TERMINATED	127.0.0.1:14174	5	16.7029	20000	124.79	200	9	124.79

(raylet) 2022-05-19 13:54:23,061	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63729 --object-store-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=63909 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:65260 --redis-password=5241590000000000 --startup-token=16 --runtime-env-hash=-2010331134
(pid=14174) 2022-05-19 13:54:30,271	WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!
(AIRPPOTrainer pid=14174) 2022-05-19 13:54:30,749	INFO trainer.py:1728 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
(AIRPPOTrainer pid=14174) 2022-05-19 13:54:30,750	INFO ppo.py:361 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
(AIRPPOTrainer pid=14174) 2022-05-19 13:54:30,750	INFO trainer.py:328 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(raylet) 2022-05-19 13:54:31,857	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63729 --object-store-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=63909 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:65260 --redis-password=5241590000000000 --startup-token=17 --runtime-env-hash=-2010331134
(raylet) 2022-05-19 13:54:31,857	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63729 --object-store-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=63909 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:65260 --redis-password=5241590000000000 --startup-token=18 --runtime-env-hash=-2010331134
(RolloutWorker pid=14179) 2022-05-19 13:54:39,442	WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!
(RolloutWorker pid=14180) 2022-05-19 13:54:39,492	WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!
(AIRPPOTrainer pid=14174) 2022-05-19 13:54:40,836	INFO trainable.py:163 -- Trainable.setup took 10.087 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
(AIRPPOTrainer pid=14174) 2022-05-19 13:54:40,836	WARNING util.py:65 -- Install gputil for GPU system monitoring.
(AIRPPOTrainer pid=14174) 2022-05-19 13:54:42,569	WARNING deprecation.py:47 -- DeprecationWarning: `slice` has been deprecated. Use `SampleBatch[start:stop]` instead. This will raise an error in the future!

Result for AIRPPOTrainer_cd8d6_00000:
  agent_timesteps_total: 4000
  counters:
    num_agent_steps_sampled: 4000
    num_agent_steps_trained: 4000
    num_env_steps_sampled: 4000
    num_env_steps_trained: 4000
  custom_metrics: {}
  date: 2022-05-19_13-54-44
  done: false
  episode_len_mean: 22.11731843575419
  episode_media: {}
  episode_reward_max: 87.0
  episode_reward_mean: 22.11731843575419
  episode_reward_min: 8.0
  episodes_this_iter: 179
  episodes_total: 179
  experiment_id: 158c57d8b6e142ad85b393db57c8bdff
  hostname: Kais-MacBook-Pro.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6653298139572144
          entropy_coeff: 0.0
          kl: 0.02798665314912796
          model: {}
          policy_loss: -0.0422092080116272
          total_loss: 8.986403465270996
          vf_explained_var: -0.06533512473106384
          vf_loss: 9.023015022277832
        num_agent_steps_trained: 128.0
    num_agent_steps_sampled: 4000
    num_agent_steps_trained: 4000
    num_env_steps_sampled: 4000
    num_env_steps_trained: 4000
  iterations_since_restore: 1
  node_ip: 127.0.0.1
  num_agent_steps_sampled: 4000
  num_agent_steps_trained: 4000
  num_env_steps_sampled: 4000
  num_env_steps_sampled_this_iter: 4000
  num_env_steps_trained: 4000
  num_env_steps_trained_this_iter: 4000
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 24.849999999999998
    ram_util_percent: 61.199999999999996
  pid: 14174
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.06886580197141673
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.05465748139159193
    mean_inference_ms: 0.6132523881103351
    mean_raw_obs_processing_ms: 0.10609273714105154
  sampler_results:
    custom_metrics: {}
    episode_len_mean: 22.11731843575419
    episode_media: {}
    episode_reward_max: 87.0
    episode_reward_mean: 22.11731843575419
    episode_reward_min: 8.0
    episodes_this_iter: 179
    hist_stats:
      episode_lengths:
      - 28
      - 9
      - 12
      - 23
      - 13
      - 21
      - 15
      - 16
      - 19
      - 44
      - 14
      - 19
      - 19
      - 17
      - 17
      - 12
      - 9
      - 48
      - 43
      - 15
      - 21
      - 25
      - 16
      - 14
      - 22
      - 21
      - 24
      - 53
      - 21
      - 16
      - 17
      - 14
      - 20
      - 22
      - 18
      - 17
      - 14
      - 11
      - 46
      - 12
      - 18
      - 21
      - 13
      - 58
      - 10
      - 20
      - 14
      - 25
      - 22
      - 33
      - 23
      - 10
      - 25
      - 11
      - 32
      - 48
      - 12
      - 12
      - 10
      - 24
      - 15
      - 28
      - 14
      - 16
      - 14
      - 21
      - 12
      - 13
      - 8
      - 12
      - 13
      - 10
      - 10
      - 14
      - 30
      - 16
      - 23
      - 47
      - 14
      - 22
      - 11
      - 18
      - 12
      - 21
      - 21
      - 20
      - 18
      - 29
      - 18
      - 24
      - 50
      - 87
      - 21
      - 41
      - 21
      - 34
      - 47
      - 20
      - 26
      - 14
      - 9
      - 24
      - 16
      - 18
      - 44
      - 28
      - 37
      - 10
      - 19
      - 11
      - 56
      - 11
      - 28
      - 16
      - 14
      - 19
      - 23
      - 11
      - 22
      - 63
      - 22
      - 13
      - 29
      - 11
      - 64
      - 44
      - 45
      - 38
      - 17
      - 18
      - 21
      - 13
      - 12
      - 13
      - 10
      - 17
      - 14
      - 16
      - 10
      - 19
      - 25
      - 15
      - 50
      - 13
      - 10
      - 15
      - 12
      - 15
      - 11
      - 14
      - 17
      - 17
      - 14
      - 49
      - 18
      - 13
      - 28
      - 31
      - 19
      - 26
      - 31
      - 29
      - 21
      - 23
      - 17
      - 23
      - 32
      - 35
      - 10
      - 11
      - 30
      - 21
      - 16
      - 15
      - 23
      - 40
      - 24
      - 24
      - 14
      episode_reward:
      - 28.0
      - 9.0
      - 12.0
      - 23.0
      - 13.0
      - 21.0
      - 15.0
      - 16.0
      - 19.0
      - 44.0
      - 14.0
      - 19.0
      - 19.0
      - 17.0
      - 17.0
      - 12.0
      - 9.0
      - 48.0
      - 43.0
      - 15.0
      - 21.0
      - 25.0
      - 16.0
      - 14.0
      - 22.0
      - 21.0
      - 24.0
      - 53.0
      - 21.0
      - 16.0
      - 17.0
      - 14.0
      - 20.0
      - 22.0
      - 18.0
      - 17.0
      - 14.0
      - 11.0
      - 46.0
      - 12.0
      - 18.0
      - 21.0
      - 13.0
      - 58.0
      - 10.0
      - 20.0
      - 14.0
      - 25.0
      - 22.0
      - 33.0
      - 23.0
      - 10.0
      - 25.0
      - 11.0
      - 32.0
      - 48.0
      - 12.0
      - 12.0
      - 10.0
      - 24.0
      - 15.0
      - 28.0
      - 14.0
      - 16.0
      - 14.0
      - 21.0
      - 12.0
      - 13.0
      - 8.0
      - 12.0
      - 13.0
      - 10.0
      - 10.0
      - 14.0
      - 30.0
      - 16.0
      - 23.0
      - 47.0
      - 14.0
      - 22.0
      - 11.0
      - 18.0
      - 12.0
      - 21.0
      - 21.0
      - 20.0
      - 18.0
      - 29.0
      - 18.0
      - 24.0
      - 50.0
      - 87.0
      - 21.0
      - 41.0
      - 21.0
      - 34.0
      - 47.0
      - 20.0
      - 26.0
      - 14.0
      - 9.0
      - 24.0
      - 16.0
      - 18.0
      - 44.0
      - 28.0
      - 37.0
      - 10.0
      - 19.0
      - 11.0
      - 56.0
      - 11.0
      - 28.0
      - 16.0
      - 14.0
      - 19.0
      - 23.0
      - 11.0
      - 22.0
      - 63.0
      - 22.0
      - 13.0
      - 29.0
      - 11.0
      - 64.0
      - 44.0
      - 45.0
      - 38.0
      - 17.0
      - 18.0
      - 21.0
      - 13.0
      - 12.0
      - 13.0
      - 10.0
      - 17.0
      - 14.0
      - 16.0
      - 10.0
      - 19.0
      - 25.0
      - 15.0
      - 50.0
      - 13.0
      - 10.0
      - 15.0
      - 12.0
      - 15.0
      - 11.0
      - 14.0
      - 17.0
      - 17.0
      - 14.0
      - 49.0
      - 18.0
      - 13.0
      - 28.0
      - 31.0
      - 19.0
      - 26.0
      - 31.0
      - 29.0
      - 21.0
      - 23.0
      - 17.0
      - 23.0
      - 32.0
      - 35.0
      - 10.0
      - 11.0
      - 30.0
      - 21.0
      - 16.0
      - 15.0
      - 23.0
      - 40.0
      - 24.0
      - 24.0
      - 14.0
    off_policy_estimator: {}
    policy_reward_max: {}
    policy_reward_mean: {}
    policy_reward_min: {}
    sampler_perf:
      mean_action_processing_ms: 0.06886580197141673
      mean_env_render_ms: 0.0
      mean_env_wait_ms: 0.05465748139159193
      mean_inference_ms: 0.6132523881103351
      mean_raw_obs_processing_ms: 0.10609273714105154
  time_since_restore: 3.7304069995880127
  time_this_iter_s: 3.7304069995880127
  time_total_s: 3.7304069995880127
  timers:
    learn_throughput: 2006.2
    learn_time_ms: 1993.819
    load_throughput: 24708712.813
    load_time_ms: 0.162
    training_iteration_time_ms: 3726.731
    update_time_ms: 1.95
  timestamp: 1652964884
  timesteps_since_restore: 0
  timesteps_total: 4000
  training_iteration: 1
  trial_id: cd8d6_00000
  warmup_time: 10.095139741897583
  
Result for AIRPPOTrainer_cd8d6_00000:
  agent_timesteps_total: 12000
  counters:
    num_agent_steps_sampled: 12000
    num_agent_steps_trained: 12000
    num_env_steps_sampled: 12000
    num_env_steps_trained: 12000
  custom_metrics: {}
  date: 2022-05-19_13-54-51
  done: false
  episode_len_mean: 65.15
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 65.15
  episode_reward_min: 9.0
  episodes_this_iter: 44
  episodes_total: 311
  experiment_id: 158c57d8b6e142ad85b393db57c8bdff
  hostname: Kais-MacBook-Pro.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 4.999999873689376e-05
          entropy: 0.5750519633293152
          entropy_coeff: 0.0
          kl: 0.012749233283102512
          model: {}
          policy_loss: -0.026830431073904037
          total_loss: 9.414541244506836
          vf_explained_var: 0.046859823167324066
          vf_loss: 9.43754768371582
        num_agent_steps_trained: 128.0
    num_agent_steps_sampled: 12000
    num_agent_steps_trained: 12000
    num_env_steps_sampled: 12000
    num_env_steps_trained: 12000
  iterations_since_restore: 3
  node_ip: 127.0.0.1
  num_agent_steps_sampled: 12000
  num_agent_steps_trained: 12000
  num_env_steps_sampled: 12000
  num_env_steps_sampled_this_iter: 4000
  num_env_steps_trained: 12000
  num_env_steps_trained_this_iter: 4000
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 20.9
    ram_util_percent: 61.379999999999995
  pid: 14174
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.06834399059626647
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.05423359203664157
    mean_inference_ms: 0.5997818239241897
    mean_raw_obs_processing_ms: 0.0982917359628421
  sampler_results:
    custom_metrics: {}
    episode_len_mean: 65.15
    episode_media: {}
    episode_reward_max: 200.0
    episode_reward_mean: 65.15
    episode_reward_min: 9.0
    episodes_this_iter: 44
    hist_stats:
      episode_lengths:
      - 34
      - 37
      - 38
      - 23
      - 29
      - 56
      - 38
      - 13
      - 10
      - 18
      - 40
      - 23
      - 46
      - 84
      - 29
      - 44
      - 54
      - 32
      - 30
      - 100
      - 28
      - 67
      - 47
      - 40
      - 74
      - 133
      - 32
      - 28
      - 86
      - 133
      - 46
      - 60
      - 17
      - 43
      - 12
      - 51
      - 57
      - 70
      - 54
      - 73
      - 16
      - 29
      - 113
      - 45
      - 31
      - 44
      - 103
      - 62
      - 72
      - 20
      - 15
      - 35
      - 12
      - 9
      - 24
      - 10
      - 102
      - 93
      - 73
      - 27
      - 52
      - 144
      - 19
      - 140
      - 91
      - 133
      - 147
      - 140
      - 90
      - 14
      - 73
      - 71
      - 200
      - 55
      - 184
      - 103
      - 196
      - 168
      - 177
      - 38
      - 33
      - 50
      - 149
      - 67
      - 87
      - 25
      - 134
      - 42
      - 26
      - 24
      - 121
      - 61
      - 109
      - 19
      - 200
      - 60
      - 40
      - 51
      - 88
      - 30
      episode_reward:
      - 34.0
      - 37.0
      - 38.0
      - 23.0
      - 29.0
      - 56.0
      - 38.0
      - 13.0
      - 10.0
      - 18.0
      - 40.0
      - 23.0
      - 46.0
      - 84.0
      - 29.0
      - 44.0
      - 54.0
      - 32.0
      - 30.0
      - 100.0
      - 28.0
      - 67.0
      - 47.0
      - 40.0
      - 74.0
      - 133.0
      - 32.0
      - 28.0
      - 86.0
      - 133.0
      - 46.0
      - 60.0
      - 17.0
      - 43.0
      - 12.0
      - 51.0
      - 57.0
      - 70.0
      - 54.0
      - 73.0
      - 16.0
      - 29.0
      - 113.0
      - 45.0
      - 31.0
      - 44.0
      - 103.0
      - 62.0
      - 72.0
      - 20.0
      - 15.0
      - 35.0
      - 12.0
      - 9.0
      - 24.0
      - 10.0
      - 102.0
      - 93.0
      - 73.0
      - 27.0
      - 52.0
      - 144.0
      - 19.0
      - 140.0
      - 91.0
      - 133.0
      - 147.0
      - 140.0
      - 90.0
      - 14.0
      - 73.0
      - 71.0
      - 200.0
      - 55.0
      - 184.0
      - 103.0
      - 196.0
      - 168.0
      - 177.0
      - 38.0
      - 33.0
      - 50.0
      - 149.0
      - 67.0
      - 87.0
      - 25.0
      - 134.0
      - 42.0
      - 26.0
      - 24.0
      - 121.0
      - 61.0
      - 109.0
      - 19.0
      - 200.0
      - 60.0
      - 40.0
      - 51.0
      - 88.0
      - 30.0
    off_policy_estimator: {}
    policy_reward_max: {}
    policy_reward_mean: {}
    policy_reward_min: {}
    sampler_perf:
      mean_action_processing_ms: 0.06834399059626647
      mean_env_render_ms: 0.0
      mean_env_wait_ms: 0.05423359203664157
      mean_inference_ms: 0.5997818239241897
      mean_raw_obs_processing_ms: 0.0982917359628421
  time_since_restore: 10.289561986923218
  time_this_iter_s: 3.3495230674743652
  time_total_s: 10.289561986923218
  timers:
    learn_throughput: 2276.977
    learn_time_ms: 1756.715
    load_throughput: 20798201.653
    load_time_ms: 0.192
    training_iteration_time_ms: 3425.704
    update_time_ms: 1.814
  timestamp: 1652964891
  timesteps_since_restore: 0
  timesteps_total: 12000
  training_iteration: 3
  trial_id: cd8d6_00000
  warmup_time: 10.095139741897583
  
Result for AIRPPOTrainer_cd8d6_00000:
  agent_timesteps_total: 20000
  counters:
    num_agent_steps_sampled: 20000
    num_agent_steps_trained: 20000
    num_env_steps_sampled: 20000
    num_env_steps_trained: 20000
  custom_metrics: {}
  date: 2022-05-19_13-54-57
  done: true
  episode_len_mean: 124.79
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 124.79
  episode_reward_min: 9.0
  episodes_this_iter: 20
  episodes_total: 354
  experiment_id: 158c57d8b6e142ad85b393db57c8bdff
  hostname: Kais-MacBook-Pro.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 4.999999873689376e-05
          entropy: 0.5436986684799194
          entropy_coeff: 0.0
          kl: 0.0034858626313507557
          model: {}
          policy_loss: -0.012989979237318039
          total_loss: 9.49295425415039
          vf_explained_var: 0.025460055097937584
          vf_loss: 9.504897117614746
        num_agent_steps_trained: 128.0
    num_agent_steps_sampled: 20000
    num_agent_steps_trained: 20000
    num_env_steps_sampled: 20000
    num_env_steps_trained: 20000
  iterations_since_restore: 5
  node_ip: 127.0.0.1
  num_agent_steps_sampled: 20000
  num_agent_steps_trained: 20000
  num_env_steps_sampled: 20000
  num_env_steps_sampled_this_iter: 4000
  num_env_steps_trained: 20000
  num_env_steps_trained_this_iter: 4000
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 24.599999999999998
    ram_util_percent: 59.775
  pid: 14174
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.06817872750804764
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.05424549075766555
    mean_inference_ms: 0.5976919122059019
    mean_raw_obs_processing_ms: 0.09603803519062176
  sampler_results:
    custom_metrics: {}
    episode_len_mean: 124.79
    episode_media: {}
    episode_reward_max: 200.0
    episode_reward_mean: 124.79
    episode_reward_min: 9.0
    episodes_this_iter: 20
    hist_stats:
      episode_lengths:
      - 45
      - 31
      - 44
      - 103
      - 62
      - 72
      - 20
      - 15
      - 35
      - 12
      - 9
      - 24
      - 10
      - 102
      - 93
      - 73
      - 27
      - 52
      - 144
      - 19
      - 140
      - 91
      - 133
      - 147
      - 140
      - 90
      - 14
      - 73
      - 71
      - 200
      - 55
      - 184
      - 103
      - 196
      - 168
      - 177
      - 38
      - 33
      - 50
      - 149
      - 67
      - 87
      - 25
      - 134
      - 42
      - 26
      - 24
      - 121
      - 61
      - 109
      - 19
      - 200
      - 60
      - 40
      - 51
      - 88
      - 30
      - 200
      - 186
      - 200
      - 182
      - 196
      - 200
      - 200
      - 200
      - 200
      - 200
      - 200
      - 43
      - 200
      - 109
      - 156
      - 200
      - 183
      - 200
      - 200
      - 200
      - 200
      - 200
      - 107
      - 200
      - 200
      - 200
      - 200
      - 200
      - 200
      - 200
      - 200
      - 200
      - 200
      - 200
      - 89
      - 200
      - 200
      - 200
      - 200
      - 200
      - 200
      - 200
      - 200
      episode_reward:
      - 45.0
      - 31.0
      - 44.0
      - 103.0
      - 62.0
      - 72.0
      - 20.0
      - 15.0
      - 35.0
      - 12.0
      - 9.0
      - 24.0
      - 10.0
      - 102.0
      - 93.0
      - 73.0
      - 27.0
      - 52.0
      - 144.0
      - 19.0
      - 140.0
      - 91.0
      - 133.0
      - 147.0
      - 140.0
      - 90.0
      - 14.0
      - 73.0
      - 71.0
      - 200.0
      - 55.0
      - 184.0
      - 103.0
      - 196.0
      - 168.0
      - 177.0
      - 38.0
      - 33.0
      - 50.0
      - 149.0
      - 67.0
      - 87.0
      - 25.0
      - 134.0
      - 42.0
      - 26.0
      - 24.0
      - 121.0
      - 61.0
      - 109.0
      - 19.0
      - 200.0
      - 60.0
      - 40.0
      - 51.0
      - 88.0
      - 30.0
      - 200.0
      - 186.0
      - 200.0
      - 182.0
      - 196.0
      - 200.0
      - 200.0
      - 200.0
      - 200.0
      - 200.0
      - 200.0
      - 43.0
      - 200.0
      - 109.0
      - 156.0
      - 200.0
      - 183.0
      - 200.0
      - 200.0
      - 200.0
      - 200.0
      - 200.0
      - 107.0
      - 200.0
      - 200.0
      - 200.0
      - 200.0
      - 200.0
      - 200.0
      - 200.0
      - 200.0
      - 200.0
      - 200.0
      - 200.0
      - 89.0
      - 200.0
      - 200.0
      - 200.0
      - 200.0
      - 200.0
      - 200.0
      - 200.0
      - 200.0
    off_policy_estimator: {}
    policy_reward_max: {}
    policy_reward_mean: {}
    policy_reward_min: {}
    sampler_perf:
      mean_action_processing_ms: 0.06817872750804764
      mean_env_render_ms: 0.0
      mean_env_wait_ms: 0.05424549075766555
      mean_inference_ms: 0.5976919122059019
      mean_raw_obs_processing_ms: 0.09603803519062176
  time_since_restore: 16.702913284301758
  time_this_iter_s: 3.1872010231018066
  time_total_s: 16.702913284301758
  timers:
    learn_throughput: 2378.661
    learn_time_ms: 1681.619
    load_throughput: 16503261.853
    load_time_ms: 0.242
    training_iteration_time_ms: 3336.7
    update_time_ms: 1.759
  timestamp: 1652964897
  timesteps_since_restore: 0
  timesteps_total: 20000
  training_iteration: 5
  trial_id: cd8d6_00000
  warmup_time: 10.095139741897583

2022-05-19 13:54:58,548	INFO tune.py:753 -- Total run time: 36.92 seconds (35.95 seconds for the tuning loop).

And then, using the obtained checkpoint, we evaluate the policy on a fresh environment:

In [6]:

num_eval_episodes = 3

rewards = evaluate_using_checkpoint(result.checkpoint, num_episodes=num_eval_episodes)
print(f"Average reward over {num_eval_episodes} episodes: " f"{np.mean(rewards)}")

2022-05-19 13:54:58,589	INFO trainer.py:1728 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
2022-05-19 13:54:58,590	WARNING deprecation.py:47 -- DeprecationWarning: `simple_optimizer` has been deprecated. This will raise an error in the future!
2022-05-19 13:54:58,591	INFO ppo.py:361 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
2022-05-19 13:54:58,591	INFO trainer.py:328 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(RolloutWorker pid=14191) 2022-05-19 13:55:06,622	WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!
(RolloutWorker pid=14192) 2022-05-19 13:55:06,622	WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!
2022-05-19 13:55:07,968	WARNING util.py:65 -- Install gputil for GPU system monitoring.
2022-05-19 13:55:08,021	INFO trainable.py:589 -- Restored on 127.0.0.1 from checkpoint: /Users/kai/ray_results/AIRPPOTrainer_2022-05-19_13-54-16/AIRPPOTrainer_cd8d6_00000_0_2022-05-19_13-54-22/checkpoint_000005/checkpoint-5
2022-05-19 13:55:08,021	INFO trainable.py:597 -- Current state after restoring: {'_iteration': 5, '_timesteps_total': None, '_time_total': 16.702913284301758, '_episodes_total': 354}

Average reward over 3 episodes: 200.0