In this example, we'll train a reinforcement learning agent using online training.
Online training means that the data from the environment is sampled while we are running the algorithm. In contrast, offline training uses data that has been stored on disk before.
Let's start with installing our dependencies:
!pip install -qU "ray[rllib]" gymnasium
Now we can run some imports:
import argparse
import gymnasium as gym
import os
import numpy as np
import ray
from ray.air import Checkpoint
from ray.air.config import RunConfig
from ray.train.rl.rl_predictor import RLPredictor
from ray.train.rl.rl_trainer import RLTrainer
from ray.air.config import ScalingConfig
from ray.air.result import Result
from ray.rllib.algorithms.bc import BC
from ray.tune.tuner import Tuner
2022-05-19 13:54:16,520 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future! 2022-05-19 13:54:16,531 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.agents.marwil` has been deprecated. Use `ray.rllib.algorithms.marwil` instead. This will raise an error in the future!
Here we define the training function. It will create an RLTrainer
using the PPO
algorithm and kick off training on the CartPole-v1
environment:
def train_rl_ppo_online(num_workers: int, use_gpu: bool = False) -> Result:
print("Starting online training")
trainer = RLTrainer(
run_config=RunConfig(stop={"training_iteration": 5}),
scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
algorithm="PPO",
config={
"env": "CartPole-v1",
"framework": "tf",
},
)
# Todo (krfricke/xwjiang): Enable checkpoint config in RunConfig
# result = trainer.fit()
tuner = Tuner(
trainer,
_tuner_kwargs={"checkpoint_at_end": True},
)
result = tuner.fit()[0]
return result
Once we trained our RL policy, we want to evaluate it on a fresh environment. For this, we will also define a utility function:
def evaluate_using_checkpoint(checkpoint: Checkpoint, num_episodes) -> list:
predictor = RLPredictor.from_checkpoint(checkpoint)
env = gym.make("CartPole-v1")
rewards = []
for i in range(num_episodes):
obs, _ = env.reset()
reward = 0.0
terminated = truncated = False
while not terminated and not truncated:
action = predictor.predict(np.array([obs]))
obs, r, terminated, truncated, _ = env.step(action[0])
reward += r
rewards.append(reward)
return rewards
Let's put it all together. First, we run training:
result = train_rl_ppo_online(num_workers=2, use_gpu=False)
2022-05-19 13:54:16,582 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.agents.dqn.dqn.DEFAULT_CONFIG` has been deprecated. Use `ray.rllib.agents.dqn.dqn.DQNConfig(...)` instead. This will raise an error in the future!
Starting online training
2022-05-19 13:54:19,326 INFO services.py:1483 -- View the Ray dashboard at http://127.0.0.1:8267
Trial name | status | loc | iter | total time (s) | ts | reward | episode_reward_max | episode_reward_min | episode_len_mean |
---|---|---|---|---|---|---|---|---|---|
AIRPPOTrainer_cd8d6_00000 | TERMINATED | 127.0.0.1:14174 | 5 | 16.7029 | 20000 | 124.79 | 200 | 9 | 124.79 |
(raylet) 2022-05-19 13:54:23,061 INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63729 --object-store-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=63909 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:65260 --redis-password=5241590000000000 --startup-token=16 --runtime-env-hash=-2010331134 (pid=14174) 2022-05-19 13:54:30,271 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future! (AIRPPOTrainer pid=14174) 2022-05-19 13:54:30,749 INFO trainer.py:1728 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode. (AIRPPOTrainer pid=14174) 2022-05-19 13:54:30,750 INFO ppo.py:361 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you. (AIRPPOTrainer pid=14174) 2022-05-19 13:54:30,750 INFO trainer.py:328 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags. (raylet) 2022-05-19 13:54:31,857 INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63729 --object-store-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=63909 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:65260 --redis-password=5241590000000000 --startup-token=17 --runtime-env-hash=-2010331134 (raylet) 2022-05-19 13:54:31,857 INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63729 --object-store-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=63909 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:65260 --redis-password=5241590000000000 --startup-token=18 --runtime-env-hash=-2010331134 (RolloutWorker pid=14179) 2022-05-19 13:54:39,442 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future! (RolloutWorker pid=14180) 2022-05-19 13:54:39,492 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future! (AIRPPOTrainer pid=14174) 2022-05-19 13:54:40,836 INFO trainable.py:163 -- Trainable.setup took 10.087 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads. (AIRPPOTrainer pid=14174) 2022-05-19 13:54:40,836 WARNING util.py:65 -- Install gputil for GPU system monitoring. (AIRPPOTrainer pid=14174) 2022-05-19 13:54:42,569 WARNING deprecation.py:47 -- DeprecationWarning: `slice` has been deprecated. Use `SampleBatch[start:stop]` instead. This will raise an error in the future!
Result for AIRPPOTrainer_cd8d6_00000: agent_timesteps_total: 4000 counters: num_agent_steps_sampled: 4000 num_agent_steps_trained: 4000 num_env_steps_sampled: 4000 num_env_steps_trained: 4000 custom_metrics: {} date: 2022-05-19_13-54-44 done: false episode_len_mean: 22.11731843575419 episode_media: {} episode_reward_max: 87.0 episode_reward_mean: 22.11731843575419 episode_reward_min: 8.0 episodes_this_iter: 179 episodes_total: 179 experiment_id: 158c57d8b6e142ad85b393db57c8bdff hostname: Kais-MacBook-Pro.local info: learner: default_policy: custom_metrics: {} learner_stats: cur_kl_coeff: 0.20000000298023224 cur_lr: 4.999999873689376e-05 entropy: 0.6653298139572144 entropy_coeff: 0.0 kl: 0.02798665314912796 model: {} policy_loss: -0.0422092080116272 total_loss: 8.986403465270996 vf_explained_var: -0.06533512473106384 vf_loss: 9.023015022277832 num_agent_steps_trained: 128.0 num_agent_steps_sampled: 4000 num_agent_steps_trained: 4000 num_env_steps_sampled: 4000 num_env_steps_trained: 4000 iterations_since_restore: 1 node_ip: 127.0.0.1 num_agent_steps_sampled: 4000 num_agent_steps_trained: 4000 num_env_steps_sampled: 4000 num_env_steps_sampled_this_iter: 4000 num_env_steps_trained: 4000 num_env_steps_trained_this_iter: 4000 num_healthy_workers: 2 off_policy_estimator: {} perf: cpu_util_percent: 24.849999999999998 ram_util_percent: 61.199999999999996 pid: 14174 policy_reward_max: {} policy_reward_mean: {} policy_reward_min: {} sampler_perf: mean_action_processing_ms: 0.06886580197141673 mean_env_render_ms: 0.0 mean_env_wait_ms: 0.05465748139159193 mean_inference_ms: 0.6132523881103351 mean_raw_obs_processing_ms: 0.10609273714105154 sampler_results: custom_metrics: {} episode_len_mean: 22.11731843575419 episode_media: {} episode_reward_max: 87.0 episode_reward_mean: 22.11731843575419 episode_reward_min: 8.0 episodes_this_iter: 179 hist_stats: episode_lengths: - 28 - 9 - 12 - 23 - 13 - 21 - 15 - 16 - 19 - 44 - 14 - 19 - 19 - 17 - 17 - 12 - 9 - 48 - 43 - 15 - 21 - 25 - 16 - 14 - 22 - 21 - 24 - 53 - 21 - 16 - 17 - 14 - 20 - 22 - 18 - 17 - 14 - 11 - 46 - 12 - 18 - 21 - 13 - 58 - 10 - 20 - 14 - 25 - 22 - 33 - 23 - 10 - 25 - 11 - 32 - 48 - 12 - 12 - 10 - 24 - 15 - 28 - 14 - 16 - 14 - 21 - 12 - 13 - 8 - 12 - 13 - 10 - 10 - 14 - 30 - 16 - 23 - 47 - 14 - 22 - 11 - 18 - 12 - 21 - 21 - 20 - 18 - 29 - 18 - 24 - 50 - 87 - 21 - 41 - 21 - 34 - 47 - 20 - 26 - 14 - 9 - 24 - 16 - 18 - 44 - 28 - 37 - 10 - 19 - 11 - 56 - 11 - 28 - 16 - 14 - 19 - 23 - 11 - 22 - 63 - 22 - 13 - 29 - 11 - 64 - 44 - 45 - 38 - 17 - 18 - 21 - 13 - 12 - 13 - 10 - 17 - 14 - 16 - 10 - 19 - 25 - 15 - 50 - 13 - 10 - 15 - 12 - 15 - 11 - 14 - 17 - 17 - 14 - 49 - 18 - 13 - 28 - 31 - 19 - 26 - 31 - 29 - 21 - 23 - 17 - 23 - 32 - 35 - 10 - 11 - 30 - 21 - 16 - 15 - 23 - 40 - 24 - 24 - 14 episode_reward: - 28.0 - 9.0 - 12.0 - 23.0 - 13.0 - 21.0 - 15.0 - 16.0 - 19.0 - 44.0 - 14.0 - 19.0 - 19.0 - 17.0 - 17.0 - 12.0 - 9.0 - 48.0 - 43.0 - 15.0 - 21.0 - 25.0 - 16.0 - 14.0 - 22.0 - 21.0 - 24.0 - 53.0 - 21.0 - 16.0 - 17.0 - 14.0 - 20.0 - 22.0 - 18.0 - 17.0 - 14.0 - 11.0 - 46.0 - 12.0 - 18.0 - 21.0 - 13.0 - 58.0 - 10.0 - 20.0 - 14.0 - 25.0 - 22.0 - 33.0 - 23.0 - 10.0 - 25.0 - 11.0 - 32.0 - 48.0 - 12.0 - 12.0 - 10.0 - 24.0 - 15.0 - 28.0 - 14.0 - 16.0 - 14.0 - 21.0 - 12.0 - 13.0 - 8.0 - 12.0 - 13.0 - 10.0 - 10.0 - 14.0 - 30.0 - 16.0 - 23.0 - 47.0 - 14.0 - 22.0 - 11.0 - 18.0 - 12.0 - 21.0 - 21.0 - 20.0 - 18.0 - 29.0 - 18.0 - 24.0 - 50.0 - 87.0 - 21.0 - 41.0 - 21.0 - 34.0 - 47.0 - 20.0 - 26.0 - 14.0 - 9.0 - 24.0 - 16.0 - 18.0 - 44.0 - 28.0 - 37.0 - 10.0 - 19.0 - 11.0 - 56.0 - 11.0 - 28.0 - 16.0 - 14.0 - 19.0 - 23.0 - 11.0 - 22.0 - 63.0 - 22.0 - 13.0 - 29.0 - 11.0 - 64.0 - 44.0 - 45.0 - 38.0 - 17.0 - 18.0 - 21.0 - 13.0 - 12.0 - 13.0 - 10.0 - 17.0 - 14.0 - 16.0 - 10.0 - 19.0 - 25.0 - 15.0 - 50.0 - 13.0 - 10.0 - 15.0 - 12.0 - 15.0 - 11.0 - 14.0 - 17.0 - 17.0 - 14.0 - 49.0 - 18.0 - 13.0 - 28.0 - 31.0 - 19.0 - 26.0 - 31.0 - 29.0 - 21.0 - 23.0 - 17.0 - 23.0 - 32.0 - 35.0 - 10.0 - 11.0 - 30.0 - 21.0 - 16.0 - 15.0 - 23.0 - 40.0 - 24.0 - 24.0 - 14.0 off_policy_estimator: {} policy_reward_max: {} policy_reward_mean: {} policy_reward_min: {} sampler_perf: mean_action_processing_ms: 0.06886580197141673 mean_env_render_ms: 0.0 mean_env_wait_ms: 0.05465748139159193 mean_inference_ms: 0.6132523881103351 mean_raw_obs_processing_ms: 0.10609273714105154 time_since_restore: 3.7304069995880127 time_this_iter_s: 3.7304069995880127 time_total_s: 3.7304069995880127 timers: learn_throughput: 2006.2 learn_time_ms: 1993.819 load_throughput: 24708712.813 load_time_ms: 0.162 training_iteration_time_ms: 3726.731 update_time_ms: 1.95 timestamp: 1652964884 timesteps_since_restore: 0 timesteps_total: 4000 training_iteration: 1 trial_id: cd8d6_00000 warmup_time: 10.095139741897583 Result for AIRPPOTrainer_cd8d6_00000: agent_timesteps_total: 12000 counters: num_agent_steps_sampled: 12000 num_agent_steps_trained: 12000 num_env_steps_sampled: 12000 num_env_steps_trained: 12000 custom_metrics: {} date: 2022-05-19_13-54-51 done: false episode_len_mean: 65.15 episode_media: {} episode_reward_max: 200.0 episode_reward_mean: 65.15 episode_reward_min: 9.0 episodes_this_iter: 44 episodes_total: 311 experiment_id: 158c57d8b6e142ad85b393db57c8bdff hostname: Kais-MacBook-Pro.local info: learner: default_policy: custom_metrics: {} learner_stats: cur_kl_coeff: 0.30000001192092896 cur_lr: 4.999999873689376e-05 entropy: 0.5750519633293152 entropy_coeff: 0.0 kl: 0.012749233283102512 model: {} policy_loss: -0.026830431073904037 total_loss: 9.414541244506836 vf_explained_var: 0.046859823167324066 vf_loss: 9.43754768371582 num_agent_steps_trained: 128.0 num_agent_steps_sampled: 12000 num_agent_steps_trained: 12000 num_env_steps_sampled: 12000 num_env_steps_trained: 12000 iterations_since_restore: 3 node_ip: 127.0.0.1 num_agent_steps_sampled: 12000 num_agent_steps_trained: 12000 num_env_steps_sampled: 12000 num_env_steps_sampled_this_iter: 4000 num_env_steps_trained: 12000 num_env_steps_trained_this_iter: 4000 num_healthy_workers: 2 off_policy_estimator: {} perf: cpu_util_percent: 20.9 ram_util_percent: 61.379999999999995 pid: 14174 policy_reward_max: {} policy_reward_mean: {} policy_reward_min: {} sampler_perf: mean_action_processing_ms: 0.06834399059626647 mean_env_render_ms: 0.0 mean_env_wait_ms: 0.05423359203664157 mean_inference_ms: 0.5997818239241897 mean_raw_obs_processing_ms: 0.0982917359628421 sampler_results: custom_metrics: {} episode_len_mean: 65.15 episode_media: {} episode_reward_max: 200.0 episode_reward_mean: 65.15 episode_reward_min: 9.0 episodes_this_iter: 44 hist_stats: episode_lengths: - 34 - 37 - 38 - 23 - 29 - 56 - 38 - 13 - 10 - 18 - 40 - 23 - 46 - 84 - 29 - 44 - 54 - 32 - 30 - 100 - 28 - 67 - 47 - 40 - 74 - 133 - 32 - 28 - 86 - 133 - 46 - 60 - 17 - 43 - 12 - 51 - 57 - 70 - 54 - 73 - 16 - 29 - 113 - 45 - 31 - 44 - 103 - 62 - 72 - 20 - 15 - 35 - 12 - 9 - 24 - 10 - 102 - 93 - 73 - 27 - 52 - 144 - 19 - 140 - 91 - 133 - 147 - 140 - 90 - 14 - 73 - 71 - 200 - 55 - 184 - 103 - 196 - 168 - 177 - 38 - 33 - 50 - 149 - 67 - 87 - 25 - 134 - 42 - 26 - 24 - 121 - 61 - 109 - 19 - 200 - 60 - 40 - 51 - 88 - 30 episode_reward: - 34.0 - 37.0 - 38.0 - 23.0 - 29.0 - 56.0 - 38.0 - 13.0 - 10.0 - 18.0 - 40.0 - 23.0 - 46.0 - 84.0 - 29.0 - 44.0 - 54.0 - 32.0 - 30.0 - 100.0 - 28.0 - 67.0 - 47.0 - 40.0 - 74.0 - 133.0 - 32.0 - 28.0 - 86.0 - 133.0 - 46.0 - 60.0 - 17.0 - 43.0 - 12.0 - 51.0 - 57.0 - 70.0 - 54.0 - 73.0 - 16.0 - 29.0 - 113.0 - 45.0 - 31.0 - 44.0 - 103.0 - 62.0 - 72.0 - 20.0 - 15.0 - 35.0 - 12.0 - 9.0 - 24.0 - 10.0 - 102.0 - 93.0 - 73.0 - 27.0 - 52.0 - 144.0 - 19.0 - 140.0 - 91.0 - 133.0 - 147.0 - 140.0 - 90.0 - 14.0 - 73.0 - 71.0 - 200.0 - 55.0 - 184.0 - 103.0 - 196.0 - 168.0 - 177.0 - 38.0 - 33.0 - 50.0 - 149.0 - 67.0 - 87.0 - 25.0 - 134.0 - 42.0 - 26.0 - 24.0 - 121.0 - 61.0 - 109.0 - 19.0 - 200.0 - 60.0 - 40.0 - 51.0 - 88.0 - 30.0 off_policy_estimator: {} policy_reward_max: {} policy_reward_mean: {} policy_reward_min: {} sampler_perf: mean_action_processing_ms: 0.06834399059626647 mean_env_render_ms: 0.0 mean_env_wait_ms: 0.05423359203664157 mean_inference_ms: 0.5997818239241897 mean_raw_obs_processing_ms: 0.0982917359628421 time_since_restore: 10.289561986923218 time_this_iter_s: 3.3495230674743652 time_total_s: 10.289561986923218 timers: learn_throughput: 2276.977 learn_time_ms: 1756.715 load_throughput: 20798201.653 load_time_ms: 0.192 training_iteration_time_ms: 3425.704 update_time_ms: 1.814 timestamp: 1652964891 timesteps_since_restore: 0 timesteps_total: 12000 training_iteration: 3 trial_id: cd8d6_00000 warmup_time: 10.095139741897583 Result for AIRPPOTrainer_cd8d6_00000: agent_timesteps_total: 20000 counters: num_agent_steps_sampled: 20000 num_agent_steps_trained: 20000 num_env_steps_sampled: 20000 num_env_steps_trained: 20000 custom_metrics: {} date: 2022-05-19_13-54-57 done: true episode_len_mean: 124.79 episode_media: {} episode_reward_max: 200.0 episode_reward_mean: 124.79 episode_reward_min: 9.0 episodes_this_iter: 20 episodes_total: 354 experiment_id: 158c57d8b6e142ad85b393db57c8bdff hostname: Kais-MacBook-Pro.local info: learner: default_policy: custom_metrics: {} learner_stats: cur_kl_coeff: 0.30000001192092896 cur_lr: 4.999999873689376e-05 entropy: 0.5436986684799194 entropy_coeff: 0.0 kl: 0.0034858626313507557 model: {} policy_loss: -0.012989979237318039 total_loss: 9.49295425415039 vf_explained_var: 0.025460055097937584 vf_loss: 9.504897117614746 num_agent_steps_trained: 128.0 num_agent_steps_sampled: 20000 num_agent_steps_trained: 20000 num_env_steps_sampled: 20000 num_env_steps_trained: 20000 iterations_since_restore: 5 node_ip: 127.0.0.1 num_agent_steps_sampled: 20000 num_agent_steps_trained: 20000 num_env_steps_sampled: 20000 num_env_steps_sampled_this_iter: 4000 num_env_steps_trained: 20000 num_env_steps_trained_this_iter: 4000 num_healthy_workers: 2 off_policy_estimator: {} perf: cpu_util_percent: 24.599999999999998 ram_util_percent: 59.775 pid: 14174 policy_reward_max: {} policy_reward_mean: {} policy_reward_min: {} sampler_perf: mean_action_processing_ms: 0.06817872750804764 mean_env_render_ms: 0.0 mean_env_wait_ms: 0.05424549075766555 mean_inference_ms: 0.5976919122059019 mean_raw_obs_processing_ms: 0.09603803519062176 sampler_results: custom_metrics: {} episode_len_mean: 124.79 episode_media: {} episode_reward_max: 200.0 episode_reward_mean: 124.79 episode_reward_min: 9.0 episodes_this_iter: 20 hist_stats: episode_lengths: - 45 - 31 - 44 - 103 - 62 - 72 - 20 - 15 - 35 - 12 - 9 - 24 - 10 - 102 - 93 - 73 - 27 - 52 - 144 - 19 - 140 - 91 - 133 - 147 - 140 - 90 - 14 - 73 - 71 - 200 - 55 - 184 - 103 - 196 - 168 - 177 - 38 - 33 - 50 - 149 - 67 - 87 - 25 - 134 - 42 - 26 - 24 - 121 - 61 - 109 - 19 - 200 - 60 - 40 - 51 - 88 - 30 - 200 - 186 - 200 - 182 - 196 - 200 - 200 - 200 - 200 - 200 - 200 - 43 - 200 - 109 - 156 - 200 - 183 - 200 - 200 - 200 - 200 - 200 - 107 - 200 - 200 - 200 - 200 - 200 - 200 - 200 - 200 - 200 - 200 - 200 - 89 - 200 - 200 - 200 - 200 - 200 - 200 - 200 - 200 episode_reward: - 45.0 - 31.0 - 44.0 - 103.0 - 62.0 - 72.0 - 20.0 - 15.0 - 35.0 - 12.0 - 9.0 - 24.0 - 10.0 - 102.0 - 93.0 - 73.0 - 27.0 - 52.0 - 144.0 - 19.0 - 140.0 - 91.0 - 133.0 - 147.0 - 140.0 - 90.0 - 14.0 - 73.0 - 71.0 - 200.0 - 55.0 - 184.0 - 103.0 - 196.0 - 168.0 - 177.0 - 38.0 - 33.0 - 50.0 - 149.0 - 67.0 - 87.0 - 25.0 - 134.0 - 42.0 - 26.0 - 24.0 - 121.0 - 61.0 - 109.0 - 19.0 - 200.0 - 60.0 - 40.0 - 51.0 - 88.0 - 30.0 - 200.0 - 186.0 - 200.0 - 182.0 - 196.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 43.0 - 200.0 - 109.0 - 156.0 - 200.0 - 183.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 107.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 89.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 - 200.0 off_policy_estimator: {} policy_reward_max: {} policy_reward_mean: {} policy_reward_min: {} sampler_perf: mean_action_processing_ms: 0.06817872750804764 mean_env_render_ms: 0.0 mean_env_wait_ms: 0.05424549075766555 mean_inference_ms: 0.5976919122059019 mean_raw_obs_processing_ms: 0.09603803519062176 time_since_restore: 16.702913284301758 time_this_iter_s: 3.1872010231018066 time_total_s: 16.702913284301758 timers: learn_throughput: 2378.661 learn_time_ms: 1681.619 load_throughput: 16503261.853 load_time_ms: 0.242 training_iteration_time_ms: 3336.7 update_time_ms: 1.759 timestamp: 1652964897 timesteps_since_restore: 0 timesteps_total: 20000 training_iteration: 5 trial_id: cd8d6_00000 warmup_time: 10.095139741897583
2022-05-19 13:54:58,548 INFO tune.py:753 -- Total run time: 36.92 seconds (35.95 seconds for the tuning loop).
And then, using the obtained checkpoint, we evaluate the policy on a fresh environment:
num_eval_episodes = 3
rewards = evaluate_using_checkpoint(result.checkpoint, num_episodes=num_eval_episodes)
print(f"Average reward over {num_eval_episodes} episodes: " f"{np.mean(rewards)}")
2022-05-19 13:54:58,589 INFO trainer.py:1728 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode. 2022-05-19 13:54:58,590 WARNING deprecation.py:47 -- DeprecationWarning: `simple_optimizer` has been deprecated. This will raise an error in the future! 2022-05-19 13:54:58,591 INFO ppo.py:361 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you. 2022-05-19 13:54:58,591 INFO trainer.py:328 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags. (RolloutWorker pid=14191) 2022-05-19 13:55:06,622 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future! (RolloutWorker pid=14192) 2022-05-19 13:55:06,622 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future! 2022-05-19 13:55:07,968 WARNING util.py:65 -- Install gputil for GPU system monitoring. 2022-05-19 13:55:08,021 INFO trainable.py:589 -- Restored on 127.0.0.1 from checkpoint: /Users/kai/ray_results/AIRPPOTrainer_2022-05-19_13-54-16/AIRPPOTrainer_cd8d6_00000_0_2022-05-19_13-54-22/checkpoint_000005/checkpoint-5 2022-05-19 13:55:08,021 INFO trainable.py:597 -- Current state after restoring: {'_iteration': 5, '_timesteps_total': None, '_time_total': 16.702913284301758, '_episodes_total': 354}
Average reward over 3 episodes: 200.0