Setup¶

In [ ]:

%%capture
! pip install "ray[rllib, serve, tune]==2.2.0"
! pip install "pyarrow==10.0.0"
! pip install "tensorflow>=2.9.0"
! pip install "transformers>=4.24.0"
! pip install "pygame==2.1.2" "gym==0.25.0"

In [2]:

import ray
ray.init()

2023-03-18 08:06:40,198	INFO worker.py:1529 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265

Out[2]:

Ray

Python version:	3.9.16
Ray version:	2.2.0
Dashboard:	http://127.0.0.1:8265

Data Processing with Ray Datasets¶

The following simple example creates a distributed Dataset on your local Ray Cluster from a Python data structure. Specifically, you’ll create a dataset from a Python dictionary containing a string name and an integer-valued data for 10,000 entries:

In [3]:

items = [{"name": str(i), "data": i} for i in range(10000)]
ds = ray.data.from_items(items)
ds.show(5)

{'name': '0', 'data': 0}
{'name': '1', 'data': 1}
{'name': '2', 'data': 2}
{'name': '3', 'data': 3}
{'name': '4', 'data': 4}

Great, now you have some rows, but what can you do with that data? The Dataset API bets heavily on functional programming, as this paradigm is well suited for data transformations.

Even though Python 3 made a point of hiding some of its functional programming capabilities, you’re probably familiar with functionality such as map, filter, flat_map, and others. If not, it’s easy enough to pick up: map takes each element of your dataset and transforms it into something else, in parallel; filter removes data points according to a Boolean filter function; and the slightly more elaborate flat_map first maps values similarly to map, but then it also “flattens” the result. For instance, if map produced a list of lists, flat_map would flatten out the nested lists and give you just a list. Equipped with these three functional API calls, let’s see how easily you can transform your dataset ds:

In [4]:

#We map each row of ds to only keep the square value of its data entry.
squares = ds.map(lambda x: x["data"] ** 2)

#Then we filter the squares to keep only even numbers (a total of five thousand elements).
evens = squares.filter(lambda x: x % 2 == 0)
evens.count()

#We then use flat_map to augment the remaining values with their respective cubes.
cubes = evens.flat_map(lambda x: [x, x**3])

#To take a total of 10 values means to leave Ray and return a Python list with 
#these values that we can print.
sample = cubes.take(10)
print(sample)

2023-03-18 08:17:43,549	WARNING dataset.py:4233 -- The `map`, `flat_map`, and `filter` operations are unvectorized and can be very slow. Consider using `.map_batches()` instead.
Map: 100%|██████████| 200/200 [00:02<00:00, 78.04it/s] 
Filter: 100%|██████████| 200/200 [00:00<00:00, 403.82it/s]
Flat_Map: 100%|██████████| 200/200 [00:00<00:00, 329.68it/s]

[0, 0, 4, 64, 16, 4096, 36, 46656, 64, 262144]

The drawback of Dataset transformations is that each step gets executed synchronously. In this example that is a nonissue, but for complex tasks that, for example, mix reading files and processing data, you would want an execution that can overlap individual tasks. DatasetPipeline does exactly that. Let’s rewrite the previous example into a pipeline:

In [5]:

#You can turn a Dataset into a pipeline by calling .window() on it.
pipe = ds.window()

#Pipeline steps can be chained to yield the same result as before.
result = pipe\
    .map(lambda x: x["data"] ** 2)\
    .filter(lambda x: x % 2 == 0)\
    .flat_map(lambda x: [x, x**3])
result.show(10)

2023-03-18 08:20:49,252	INFO dataset.py:3693 -- Created DatasetPipeline with 20 windows: 7390b min, 8000b max, 7944b mean
2023-03-18 08:20:49,255	INFO dataset.py:3703 -- Blocks per window: 10 min, 10 max, 10 mean
2023-03-18 08:20:49,262	INFO dataset.py:3725 -- ✔️  This pipeline's per-window parallelism is high enough to fully utilize the cluster.
2023-03-18 08:20:49,266	INFO dataset.py:3742 -- ✔️  This pipeline's windows likely fit in object store memory without spilling.
Stage 0:   0%|          | 0/20 [00:00<?, ?it/s]
  0%|          | 0/20 [00:00<?, ?it/s]
Stage 1:   0%|          | 0/20 [00:00<?, ?it/s]
Stage 1:   5%|▌         | 1/20 [00:00<00:03,  5.80it/s]
Stage 0:  10%|█         | 2/20 [00:00<00:01, 10.96it/s]

Model Training¶

Moving on to the next set of libraries, let’s look at the distributed training capabilities of Ray. For that, you have access to two libraries. One is dedicated to reinforcement learning specifically; the other one has a different scope and is aimed primarily at supervised learning tasks.

Reinforcement learning with Ray RLlib¶

Let’s start with Ray RLlib for reinforcement learning (RL). This library is powered by the modern ML frameworks TensorFlow and PyTorch, and you can choose which one to use. Both frameworks seem to converge more and more conceptually, so you can pick the one you like most without losing much in the process.

One of the easiest ways to run examples with RLlib is to use the command-line tool rllib, which we already installed implicitly when we ran pip install "ray[rllib]".

We’ll look at a fairly classic control problem of balancing a pole on a cart. Imagine you have a pole like the one in figure below, fixed at a joint of a cart, and subject to gravity. The cart is free to move along a frictionless track, and you can manipulate the cart by giving it a push from the left or the right with a fixed force. If you do this well enough, the pole will remain in an upright position. For each time step the pole didn’t fall over, we get a reward of 1. Collecting a high reward is our goal, and the question is whether we can teach a reinforcement learning algorithm to do this for us.

Specifically, we want to train a reinforcement learning agent that can carry out two actions, namely, push to the left or to the right, observe what happens when interacting with the environment in that way, and learn from the experience by maximizing the reward.

To tackle this problem with Ray RLlib, we can use a so-called tuned example, which is a preconfigured algorithm that runs well for a given problem. You can run a tuned example with a single command. RLlib comes with many such examples, and you can list them all with rllib example list.

In [7]:

! rllib example list

                                 RLlib Examples                                 
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Example ID                      ┃ Description                                ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ atari-a2c                       │ Runs grid search over several Atari games  │
│                                 │ on A2C.                                    │
│ atari-dqn                       │ Run grid search on Atari environments with │
│                                 │ DQN.                                       │
│ atari-duel-ddqn                 │ Run grid search on Atari environments with │
│                                 │ duelling double DQN.                       │
│ atari-impala                    │ Run grid search over several atari games   │
│                                 │ with IMPALA.                               │
│ atari-ppo                       │ Run grid search over several atari games   │
│                                 │ with PPO.                                  │
│ atari-sac                       │ Run grid search on several atari games     │
│                                 │ with SAC.                                  │
│ breakout-apex-dqn               │ Runs Apex DQN on BreakoutNoFrameskip-v4.   │
│ breakout-ddppo                  │ Runs DDPPO on BreakoutNoFrameskip-v4.      │
│ cartpole-a2c                    │ Runs A2C on the CartPole-v1 environment.   │
│ cartpole-a2c-micro              │ Runs A2C on the CartPole-v1 environment,   │
│                                 │ using micro-batches.                       │
│ cartpole-a3c                    │ Runs A3C on the CartPole-v1 environment.   │
│ cartpole-alpha-zero             │ Runs AlphaZero on a Cartpole with sparse   │
│                                 │ rewards.                                   │
│ cartpole-apex-dqn               │ Runs Apex DQN on CartPole-v1.              │
│ cartpole-appo                   │ Runs APPO on CartPole-v1.                  │
│ cartpole-ars                    │ Runs ARS on CartPole-v1.                   │
│ cartpole-bc                     │ Runs BC on CartPole-v1.                    │
│ cartpole-crr                    │ Run CRR on CartPole-v1.                    │
│ cartpole-ddppo                  │ Runs DDPPO on CartPole-v1                  │
│ cartpole-dqn                    │ Run DQN on CartPole-v1.                    │
│ cartpole-dt                     │ Run DT on CartPole-v1.                     │
│ cartpole-es                     │ Run ES on CartPole-v1.                     │
│ cartpole-impala                 │ Run IMPALA on CartPole-v1.                 │
│ cartpole-maml                   │ Run MAML on CartPole-v1.                   │
│ cartpole-marwil                 │ Run MARWIL on CartPole-v1.                 │
│ cartpole-mbmpo                  │ Run MBMPO on a CartPole environment        │
│                                 │ wrapper.                                   │
│ cartpole-pg                     │ Run PG on CartPole-v1                      │
│ cartpole-ppo                    │ Run PPO on CartPole-v1.                    │
│ cartpole-sac                    │ Run SAC on CartPole-v1                     │
│ cartpole-simpleq                │ Run SimpleQ on CartPole-v1                 │
│ dm-control-dreamer              │ Run DREAMER on a suite of control problems │
│                                 │ by Deepmind.                               │
│ frozenlake-appo                 │ Runs APPO on FrozenLake-v1.                │
│ halfcheetah-appo                │ Runs APPO on HalfCheetah-v2.               │
│ halfcheetah-bullet-ddpg         │ Runs DDPG on HalfCheetahBulletEnv-v0.      │
│ halfcheetah-cql                 │ Runs grid search on HalfCheetah            │
│                                 │ environments with CQL.                     │
│ halfcheetah-ddpg                │ Runs DDPG on HalfCheetah-v2.               │
│ halfcheetah-maml                │ Run MAML on a custom HalfCheetah           │
│                                 │ environment.                               │
│ halfcheetah-mbmpo               │ Run MBMPO on a HalfCheetah environment     │
│                                 │ wrapper.                                   │
│ halfcheetah-ppo                 │ Run PPO on HalfCheetah-v2.                 │
│ halfcheetah-sac                 │ Run SAC on HalfCheetah-v3.                 │
│ hopper-bullet-ddpg              │ Runs DDPG on HopperBulletEnv-v0.           │
│ hopper-cql                      │ Runs grid search on Hopper environments    │
│                                 │ with CQL.                                  │
│ hopper-mbmpo                    │ Run MBMPO on a Hopper environment wrapper. │
│ hopper-ppo                      │ Run PPO on Hopper-v1.                      │
│ humanoid-es                     │ Run ES on Humanoid-v2.                     │
│ humanoid-ppo                    │ Run PPO on Humanoid-v1.                    │
│ inverted-pendulum-td3           │ Run TD3 on InvertedPendulum-v2.            │
│ mountaincar-apex-ddpg           │ Runs Apex DDPG on                          │
│                                 │ MountainCarContinuous-v0.                  │
│ mountaincar-ddpg                │ Runs DDPG on MountainCarContinuous-v0.     │
│ mujoco-td3                      │ Run TD3 against four of the hardest MuJoCo │
│                                 │ tasks.                                     │
│ multi-agent-cartpole-alpha-star │ Runs AlphaStar on 4 CartPole agents.       │
│ multi-agent-cartpole-appo       │ Runs APPO on RLlib's MultiAgentCartPole    │
│ multi-agent-cartpole-impala     │ Run IMPALA on RLlib's MultiAgentCartPole   │
│ pacman-sac                      │ Run SAC on MsPacmanNoFrameskip-v4.         │
│ pendulum-apex-ddpg              │ Runs Apex DDPG on Pendulum-v1.             │
│ pendulum-appo                   │ Runs APPO on Pendulum-v1.                  │
│ pendulum-cql                    │ Runs CQL on Pendulum-v1.                   │
│ pendulum-crr                    │ Run CRR on Pendulum-v1.                    │
│ pendulum-ddpg                   │ Runs DDPG on Pendulum-v1.                  │
│ pendulum-ddppo                  │ Runs DDPPO on Pendulum-v1.                 │
│ pendulum-dt                     │ Run DT on Pendulum-v1.                     │
│ pendulum-impala                 │ Run IMPALA on Pendulum-v1.                 │
│ pendulum-maml                   │ Run MAML on a custom Pendulum environment. │
│ pendulum-mbmpo                  │ Run MBMPO on a Pendulum environment        │
│                                 │ wrapper.                                   │
│ pendulum-ppo                    │ Run PPO on Pendulum-v1.                    │
│ pendulum-sac                    │ Run SAC on Pendulum-v1.                    │
│ pendulum-td3                    │ Run TD3 on Pendulum-v1.                    │
│ pong-a3c                        │ Runs A3C on the PongDeterministic-v4       │
│                                 │ environment.                               │
│ pong-apex-dqn                   │ Runs Apex DQN on PongNoFrameskip-v4.       │
│ pong-appo                       │ Runs APPO on PongNoFrameskip-v4.           │
│ pong-dqn                        │ Run DQN on PongDeterministic-v4.           │
│ pong-impala                     │ Run IMPALA on PongNoFrameskip-v4.          │
│ pong-ppo                        │ Run PPO on PongNoFrameskip-v4.             │
│ pong-rainbow                    │ Run Rainbow on PongDeterministic-v4.       │
│ recsys-bandits                  │ Runs BanditLinUCB on a Recommendation      │
│                                 │ Simulation environment.                    │
│ recsys-long-term-slateq         │ Run SlateQ on a recommendation system      │
│                                 │ aimed at long-term satisfaction.           │
│ recsys-parametric-slateq        │ SlateQ run on a recommendation system.     │
│ recsys-ppo                      │ Run PPO on a recommender system example    │
│                                 │ from RLlib.                                │
│ recsys-slateq                   │ SlateQ run on a recommendation system.     │
│ repeatafterme-ppo               │ Run PPO on RLlib's RepeatAfterMe           │
│                                 │ environment.                               │
│ stateless-cartpole-r2d2         │ Run R2D2 on a stateless cart pole          │
│                                 │ environment.                               │
│ swimmer-ars                     │ Runs ARS on Swimmer-v2.                    │
│ two-step-game-maddpg            │ Run RLlib's Two-step game with multi-agent │
│                                 │ DDPG.                                      │
│ two-step-game-qmix              │ Run QMIX on RLlib's two-step game.         │
│ walker2d-ppo                    │ Run PPO on the Walker2d-v1 environment.    │
└─────────────────────────────────┴────────────────────────────────────────────┘
Run any RLlib example as using 'rllib example run <Example ID>'.See 'rllib 
example run --help' for more information.

One of the available examples is cartpole-ppo, a tuned example that uses the PPO algorithm to solve the cart–pole problem, specifically, the CartPole-v1 environment from OpenAI Gym.

cartpole-ppo:
    env: CartPole-v1  [1]
    run: PPO  [2]
    stop:
        episode_reward_mean: 150  [3]
        timesteps_total: 100000
    config: [4]
        framework: tf
        gamma: 0.99
        lr: 0.0003
        num_workers: 1
        observation_filter: MeanStdFilter
        num_sgd_iter: 6
        vf_loss_coeff: 0.01
        model:
            fcnet_hiddens: [32]
            fcnet_activation: linear
            vf_share_layers: true
        enable_connectors: True

The CartPole-v1 environment simulates the problem we just described.
Use a powerful RL algorithm called Proximal Policy Optimization, or PPO.
Once we reach a reward of 150, stop the experiment.
PPO needs some RL-specific configuration to make it work for this problem.

The details of this configuration file don’t matter much at this point, so don’t get distracted by them. The important part is that you specify the Cartpole-v1 environment and sufficient RL-specific configuration to ensure the training procedure works. Running this configuration doesn’t require any special hardware and finishes in a matter of minutes.

In [6]:

! rllib example run cartpole-ppo

== Status ==
Current time: 2023-03-18 08:34:37 (running for 00:03:28.54)
Memory usage on this node: 2.9/12.7 GiB 
Using FIFO scheduling algorithm.
Resources requested: 2.0/2 CPUs, 0/0 GPUs, 0.0/7.36 GiB heap, 0.0/3.68 GiB objects
Result logdir: /root/ray_results/cartpole-ppo
Number of trials: 1/1 (1 RUNNING)
+-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------+
| Trial name                  | status   | loc              |   iter |   total time (s) |    ts |   reward |   episode_reward_max |   episode_reward_min |   episode_len_mean |
|-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------|
| PPO_CartPole-v1_3bb16_00000 | RUNNING  | 172.28.0.12:7624 |      5 |          168.401 | 20000 |   108.29 |                  500 |                   13 |             108.29 |
+-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------+


== Status ==
Current time: 2023-03-18 08:34:42 (running for 00:03:33.55)
Memory usage on this node: 2.9/12.7 GiB 
Using FIFO scheduling algorithm.
Resources requested: 2.0/2 CPUs, 0/0 GPUs, 0.0/7.36 GiB heap, 0.0/3.68 GiB objects
Result logdir: /root/ray_results/cartpole-ppo
Number of trials: 1/1 (1 RUNNING)
+-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------+
| Trial name                  | status   | loc              |   iter |   total time (s) |    ts |   reward |   episode_reward_max |   episode_reward_min |   episode_len_mean |
|-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------|
| PPO_CartPole-v1_3bb16_00000 | RUNNING  | 172.28.0.12:7624 |      5 |          168.401 | 20000 |   108.29 |                  500 |                   13 |             108.29 |
+-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------+


== Status ==
Current time: 2023-03-18 08:34:47 (running for 00:03:38.55)
Memory usage on this node: 2.9/12.7 GiB 
Using FIFO scheduling algorithm.
Resources requested: 2.0/2 CPUs, 0/0 GPUs, 0.0/7.36 GiB heap, 0.0/3.68 GiB objects
Result logdir: /root/ray_results/cartpole-ppo
Number of trials: 1/1 (1 RUNNING)
+-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------+
| Trial name                  | status   | loc              |   iter |   total time (s) |    ts |   reward |   episode_reward_max |   episode_reward_min |   episode_len_mean |
|-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------|
| PPO_CartPole-v1_3bb16_00000 | RUNNING  | 172.28.0.12:7624 |      5 |          168.401 | 20000 |   108.29 |                  500 |                   13 |             108.29 |
+-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------+


Result for PPO_CartPole-v1_3bb16_00000:
  agent_timesteps_total: 24000
  counters:
    num_agent_steps_sampled: 24000
    num_agent_steps_trained: 24000
    num_env_steps_sampled: 24000
    num_env_steps_trained: 24000
  custom_metrics: {}
  date: 2023-03-18_08-34-48
  done: false
  episode_len_mean: 142.47
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 142.47
  episode_reward_min: 13.0
  episodes_this_iter: 12
  episodes_total: 429
  experiment_id: db1de6e2783647b49113496fff88c803
  hostname: 0738217da70e
  info:
    learner:
      default_policy:
        custom_metrics: {}
        diff_num_grad_updates_vs_sampler_policy: 95.5
        learner_stats:
          cur_kl_coeff: 0.05000000074505806
          cur_lr: 0.0003000000142492354
          entropy: 0.5886597633361816
          entropy_coeff: 0.0
          kl: 0.0035236419644206762
          policy_loss: -0.002459704177454114
          total_loss: 0.09718986600637436
          vf_explained_var: 0.0019282136345282197
          vf_loss: 9.947339057922363
        num_agent_steps_trained: 125.0
        num_grad_updates_lifetime: 1056.5
    num_agent_steps_sampled: 24000
    num_agent_steps_trained: 24000
    num_env_steps_sampled: 24000
    num_env_steps_trained: 24000
  iterations_since_restore: 6
  node_ip: 172.28.0.12
  num_agent_steps_sampled: 24000
  num_agent_steps_trained: 24000
  num_env_steps_sampled: 24000
  num_env_steps_sampled_this_iter: 4000
  num_env_steps_trained: 24000
  num_env_steps_trained_this_iter: 4000
  num_faulty_episodes: 0
  num_healthy_workers: 1
  num_in_flight_async_reqs: 0
  num_remote_worker_restarts: 0
  num_steps_trained_this_iter: 4000
  perf:
    cpu_util_percent: 73.82000000000001
    ram_util_percent: 23.102222222222224
  pid: 7624
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.15060454694754802
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.1447757987134971
    mean_inference_ms: 4.981463969697947
    mean_raw_obs_processing_ms: 0.9265207062667011
  sampler_results:
    custom_metrics: {}
    episode_len_mean: 142.47
    episode_media: {}
    episode_reward_max: 500.0
    episode_reward_mean: 142.47
    episode_reward_min: 13.0
    episodes_this_iter: 12
    hist_stats:
      episode_lengths: [99, 13, 16, 58, 74, 14, 71, 48, 162, 37, 67, 13, 152, 24, 34,
        61, 140, 13, 24, 25, 77, 87, 60, 39, 29, 21, 30, 125, 14, 18, 147, 71, 14, 123,
        20, 169, 18, 57, 235, 23, 134, 92, 94, 127, 225, 139, 187, 174, 163, 101, 39,
        97, 65, 140, 41, 35, 17, 65, 142, 55, 169, 275, 315, 33, 155, 139, 151, 73,
        52, 183, 65, 305, 274, 500, 83, 231, 191, 144, 248, 267, 363, 37, 162, 159,
        500, 92, 138, 282, 286, 212, 56, 219, 452, 329, 500, 232, 398, 332, 491, 500]
      episode_reward: [99.0, 13.0, 16.0, 58.0, 74.0, 14.0, 71.0, 48.0, 162.0, 37.0,
        67.0, 13.0, 152.0, 24.0, 34.0, 61.0, 140.0, 13.0, 24.0, 25.0, 77.0, 87.0, 60.0,
        39.0, 29.0, 21.0, 30.0, 125.0, 14.0, 18.0, 147.0, 71.0, 14.0, 123.0, 20.0, 169.0,
        18.0, 57.0, 235.0, 23.0, 134.0, 92.0, 94.0, 127.0, 225.0, 139.0, 187.0, 174.0,
        163.0, 101.0, 39.0, 97.0, 65.0, 140.0, 41.0, 35.0, 17.0, 65.0, 142.0, 55.0,
        169.0, 275.0, 315.0, 33.0, 155.0, 139.0, 151.0, 73.0, 52.0, 183.0, 65.0, 305.0,
        274.0, 500.0, 83.0, 231.0, 191.0, 144.0, 248.0, 267.0, 363.0, 37.0, 162.0, 159.0,
        500.0, 92.0, 138.0, 282.0, 286.0, 212.0, 56.0, 219.0, 452.0, 329.0, 500.0, 232.0,
        398.0, 332.0, 491.0, 500.0]
    num_faulty_episodes: 0
    policy_reward_max: {}
    policy_reward_mean: {}
    policy_reward_min: {}
    sampler_perf:
      mean_action_processing_ms: 0.15060454694754802
      mean_env_render_ms: 0.0
      mean_env_wait_ms: 0.1447757987134971
      mean_inference_ms: 4.981463969697947
      mean_raw_obs_processing_ms: 0.9265207062667011
  time_since_restore: 199.58355569839478
  time_this_iter_s: 31.182859659194946
  time_total_s: 199.58355569839478
  timers:
    learn_throughput: 456.437
    learn_time_ms: 8763.54
    synch_weights_time_ms: 5.189
    training_iteration_time_ms: 33240.144
  timestamp: 1679128488
  timesteps_since_restore: 0
  timesteps_total: 24000
  training_iteration: 6
  trial_id: 3bb16_00000
  warmup_time: 11.70189356803894
  
(PPO pid=7624) 2023-03-18 08:34:48,893	INFO filter_manager.py:34 -- Synchronizing filters ...
(PPO pid=7624) 2023-03-18 08:34:48,899	INFO filter_manager.py:55 -- Updating remote filters ...
== Status ==
Current time: 2023-03-18 08:34:54 (running for 00:03:44.79)
Memory usage on this node: 2.9/12.7 GiB 
Using FIFO scheduling algorithm.
Resources requested: 2.0/2 CPUs, 0/0 GPUs, 0.0/7.36 GiB heap, 0.0/3.68 GiB objects
Result logdir: /root/ray_results/cartpole-ppo
Number of trials: 1/1 (1 RUNNING)
+-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------+
| Trial name                  | status   | loc              |   iter |   total time (s) |    ts |   reward |   episode_reward_max |   episode_reward_min |   episode_len_mean |
|-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------|
| PPO_CartPole-v1_3bb16_00000 | RUNNING  | 172.28.0.12:7624 |      6 |          199.584 | 24000 |   142.47 |                  500 |                   13 |             142.47 |
+-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------+


== Status ==
Current time: 2023-03-18 08:34:59 (running for 00:03:49.80)
Memory usage on this node: 2.9/12.7 GiB 
Using FIFO scheduling algorithm.
Resources requested: 2.0/2 CPUs, 0/0 GPUs, 0.0/7.36 GiB heap, 0.0/3.68 GiB objects
Result logdir: /root/ray_results/cartpole-ppo
Number of trials: 1/1 (1 RUNNING)
+-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------+
| Trial name                  | status   | loc              |   iter |   total time (s) |    ts |   reward |   episode_reward_max |   episode_reward_min |   episode_len_mean |
|-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------|
| PPO_CartPole-v1_3bb16_00000 | RUNNING  | 172.28.0.12:7624 |      6 |          199.584 | 24000 |   142.47 |                  500 |                   13 |             142.47 |
+-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------+


== Status ==
Current time: 2023-03-18 08:35:04 (running for 00:03:54.80)
Memory usage on this node: 2.9/12.7 GiB 
Using FIFO scheduling algorithm.
Resources requested: 2.0/2 CPUs, 0/0 GPUs, 0.0/7.36 GiB heap, 0.0/3.68 GiB objects
Result logdir: /root/ray_results/cartpole-ppo
Number of trials: 1/1 (1 RUNNING)
+-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------+
| Trial name                  | status   | loc              |   iter |   total time (s) |    ts |   reward |   episode_reward_max |   episode_reward_min |   episode_len_mean |
|-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------|
| PPO_CartPole-v1_3bb16_00000 | RUNNING  | 172.28.0.12:7624 |      6 |          199.584 | 24000 |   142.47 |                  500 |                   13 |             142.47 |
+-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------+


== Status ==
Current time: 2023-03-18 08:35:09 (running for 00:03:59.81)
Memory usage on this node: 2.9/12.7 GiB 
Using FIFO scheduling algorithm.
Resources requested: 2.0/2 CPUs, 0/0 GPUs, 0.0/7.36 GiB heap, 0.0/3.68 GiB objects
Result logdir: /root/ray_results/cartpole-ppo
Number of trials: 1/1 (1 RUNNING)
+-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------+
| Trial name                  | status   | loc              |   iter |   total time (s) |    ts |   reward |   episode_reward_max |   episode_reward_min |   episode_len_mean |
|-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------|
| PPO_CartPole-v1_3bb16_00000 | RUNNING  | 172.28.0.12:7624 |      6 |          199.584 | 24000 |   142.47 |                  500 |                   13 |             142.47 |
+-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------+


== Status ==
Current time: 2023-03-18 08:35:14 (running for 00:04:04.82)
Memory usage on this node: 2.9/12.7 GiB 
Using FIFO scheduling algorithm.
Resources requested: 2.0/2 CPUs, 0/0 GPUs, 0.0/7.36 GiB heap, 0.0/3.68 GiB objects
Result logdir: /root/ray_results/cartpole-ppo
Number of trials: 1/1 (1 RUNNING)
+-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------+
| Trial name                  | status   | loc              |   iter |   total time (s) |    ts |   reward |   episode_reward_max |   episode_reward_min |   episode_len_mean |
|-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------|
| PPO_CartPole-v1_3bb16_00000 | RUNNING  | 172.28.0.12:7624 |      6 |          199.584 | 24000 |   142.47 |                  500 |                   13 |             142.47 |
+-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------+


== Status ==
Current time: 2023-03-18 08:35:19 (running for 00:04:09.82)
Memory usage on this node: 2.9/12.7 GiB 
Using FIFO scheduling algorithm.
Resources requested: 2.0/2 CPUs, 0/0 GPUs, 0.0/7.36 GiB heap, 0.0/3.68 GiB objects
Result logdir: /root/ray_results/cartpole-ppo
Number of trials: 1/1 (1 RUNNING)
+-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------+
| Trial name                  | status   | loc              |   iter |   total time (s) |    ts |   reward |   episode_reward_max |   episode_reward_min |   episode_len_mean |
|-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------|
| PPO_CartPole-v1_3bb16_00000 | RUNNING  | 172.28.0.12:7624 |      6 |          199.584 | 24000 |   142.47 |                  500 |                   13 |             142.47 |
+-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------+


Result for PPO_CartPole-v1_3bb16_00000:
  agent_timesteps_total: 28000
  counters:
    num_agent_steps_sampled: 28000
    num_agent_steps_trained: 28000
    num_env_steps_sampled: 28000
    num_env_steps_trained: 28000
  custom_metrics: {}
  date: 2023-03-18_08-35-21
  done: true
  episode_len_mean: 178.85
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 178.85
  episode_reward_min: 13.0
  episodes_this_iter: 9
  episodes_total: 438
  experiment_id: db1de6e2783647b49113496fff88c803
  hostname: 0738217da70e
  info:
    learner:
      default_policy:
        custom_metrics: {}
        diff_num_grad_updates_vs_sampler_policy: 95.5
        learner_stats:
          cur_kl_coeff: 0.02500000037252903
          cur_lr: 0.0003000000142492354
          entropy: 0.5833122730255127
          entropy_coeff: 0.0
          kl: 0.003216453595086932
          policy_loss: -0.0005987154436297715
          total_loss: 0.09903717041015625
          vf_explained_var: -0.0003911629319190979
          vf_loss: 9.955548286437988
        num_agent_steps_trained: 125.0
        num_grad_updates_lifetime: 1248.5
    num_agent_steps_sampled: 28000
    num_agent_steps_trained: 28000
    num_env_steps_sampled: 28000
    num_env_steps_trained: 28000
  iterations_since_restore: 7
  node_ip: 172.28.0.12
  num_agent_steps_sampled: 28000
  num_agent_steps_trained: 28000
  num_env_steps_sampled: 28000
  num_env_steps_sampled_this_iter: 4000
  num_env_steps_trained: 28000
  num_env_steps_trained_this_iter: 4000
  num_faulty_episodes: 0
  num_healthy_workers: 1
  num_in_flight_async_reqs: 0
  num_remote_worker_restarts: 0
  num_steps_trained_this_iter: 4000
  perf:
    cpu_util_percent: 78.09347826086956
    ram_util_percent: 23.099999999999998
  pid: 7624
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.15010584224508902
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.14444693410492337
    mean_inference_ms: 4.976027958863329
    mean_raw_obs_processing_ms: 0.9240927234928833
  sampler_results:
    custom_metrics: {}
    episode_len_mean: 178.85
    episode_media: {}
    episode_reward_max: 500.0
    episode_reward_mean: 178.85
    episode_reward_min: 13.0
    episodes_this_iter: 9
    hist_stats:
      episode_lengths: [37, 67, 13, 152, 24, 34, 61, 140, 13, 24, 25, 77, 87, 60, 39,
        29, 21, 30, 125, 14, 18, 147, 71, 14, 123, 20, 169, 18, 57, 235, 23, 134, 92,
        94, 127, 225, 139, 187, 174, 163, 101, 39, 97, 65, 140, 41, 35, 17, 65, 142,
        55, 169, 275, 315, 33, 155, 139, 151, 73, 52, 183, 65, 305, 274, 500, 83, 231,
        191, 144, 248, 267, 363, 37, 162, 159, 500, 92, 138, 282, 286, 212, 56, 219,
        452, 329, 500, 232, 398, 332, 491, 500, 424, 500, 500, 500, 500, 500, 500, 500,
        269]
      episode_reward: [37.0, 67.0, 13.0, 152.0, 24.0, 34.0, 61.0, 140.0, 13.0, 24.0,
        25.0, 77.0, 87.0, 60.0, 39.0, 29.0, 21.0, 30.0, 125.0, 14.0, 18.0, 147.0, 71.0,
        14.0, 123.0, 20.0, 169.0, 18.0, 57.0, 235.0, 23.0, 134.0, 92.0, 94.0, 127.0,
        225.0, 139.0, 187.0, 174.0, 163.0, 101.0, 39.0, 97.0, 65.0, 140.0, 41.0, 35.0,
        17.0, 65.0, 142.0, 55.0, 169.0, 275.0, 315.0, 33.0, 155.0, 139.0, 151.0, 73.0,
        52.0, 183.0, 65.0, 305.0, 274.0, 500.0, 83.0, 231.0, 191.0, 144.0, 248.0, 267.0,
        363.0, 37.0, 162.0, 159.0, 500.0, 92.0, 138.0, 282.0, 286.0, 212.0, 56.0, 219.0,
        452.0, 329.0, 500.0, 232.0, 398.0, 332.0, 491.0, 500.0, 424.0, 500.0, 500.0,
        500.0, 500.0, 500.0, 500.0, 500.0, 269.0]
    num_faulty_episodes: 0
    policy_reward_max: {}
    policy_reward_mean: {}
    policy_reward_min: {}
    sampler_perf:
      mean_action_processing_ms: 0.15010584224508902
      mean_env_render_ms: 0.0
      mean_env_wait_ms: 0.14444693410492337
      mean_inference_ms: 4.976027958863329
      mean_raw_obs_processing_ms: 0.9240927234928833
  time_since_restore: 232.44457721710205
  time_this_iter_s: 32.861021518707275
  time_total_s: 232.44457721710205
  timers:
    learn_throughput: 470.055
    learn_time_ms: 8509.634
    synch_weights_time_ms: 5.025
    training_iteration_time_ms: 33183.327
  timestamp: 1679128521
  timesteps_since_restore: 0
  timesteps_total: 28000
  training_iteration: 7
  trial_id: 3bb16_00000
  warmup_time: 11.70189356803894
  
(PPO pid=7624) 2023-03-18 08:35:21,839	INFO filter_manager.py:34 -- Synchronizing filters ...
(PPO pid=7624) 2023-03-18 08:35:21,848	INFO filter_manager.py:55 -- Updating remote filters ...
== Status ==
Current time: 2023-03-18 08:35:21 (running for 00:04:12.69)
Memory usage on this node: 2.9/12.7 GiB 
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/0 GPUs, 0.0/7.36 GiB heap, 0.0/3.68 GiB objects
Result logdir: /root/ray_results/cartpole-ppo
Number of trials: 1/1 (1 TERMINATED)
+-----------------------------+------------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------+
| Trial name                  | status     | loc              |   iter |   total time (s) |    ts |   reward |   episode_reward_max |   episode_reward_min |   episode_len_mean |
|-----------------------------+------------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------|
| PPO_CartPole-v1_3bb16_00000 | TERMINATED | 172.28.0.12:7624 |      7 |          232.445 | 28000 |   178.85 |                  500 |                   13 |             178.85 |
+-----------------------------+------------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------+


2023-03-18 08:35:22,135	INFO tune.py:762 -- Total run time: 252.92 seconds (252.67 seconds for the tuning loop).

Your training finished.
Best available checkpoint for each trial:
  /root/ray_results/cartpole-ppo/PPO_CartPole-v1_3bb16_00000_0_2023-03-18_08-31-
09/checkpoint_000007

You can now evaluate your trained algorithm from any checkpoint, e.g. by 
running:
╭──────────────────────────────────────────────────────────────────────────────╮
│   rllib evaluate                                                             │
│ /root/ray_results/cartpole-ppo/PPO_CartPole-v1_3bb16_00000_0_2023-03-18_08-3 │
│ 1-09/checkpoint_000007 --algo PPO                                            │
╰──────────────────────────────────────────────────────────────────────────────╯

Your local Ray checkpoint folder is ~/ray-results by default. For the training configuration we used, your should be of the form ~/ray_results/cartpole-ppo/PPO_CartPole-v1\<experiment_id>_. During the training procedure, your intermediate and final model checkpoints get generated into this folder.

To evaluate the performance of your trained RL algorithm, you can now evaluate it from checkpoint by copying the command the previous example training run printed.

Running this command will print evaluation results, namely, the rewards achieved by your trained RL algorithm on the CartPole-v1 environment.

In [8]:

! rllib evaluate /root/ray_results/cartpole-ppo/PPO_CartPole-v1_3bb16_00000_0_2023-03-18_08-31-09/checkpoint_000007 --algo PPO

2023-03-18 08:40:10,744	INFO algorithm.py:1005 -- Ran round 1 of parallel evaluation (1/1 episodes done)
Episode #23: reward: 500.0

Distributed training with Ray Train¶

Ray RLlib is dedicated to reinforcement learning, but what do you do if you need to train models for other types of machine learning, like supervised learning? You can use another Ray library for distributed training in this case: Ray Train.

Hyperparameter tuning with Ray Tune¶

In [9]:

from ray import tune
import math
import time


#Simulate an expensive training function that depends on two hyperparameters, x and y, read from a config.
def training_function(config):
    x, y = config["x"], config["y"]
    time.sleep(10)
    score = objective(x, y)
    #After sleeping for 10 seconds to simulate training and computing the objective, the 
    #score is reported to tune.
    tune.report(score=score)


#The objective computes the mean of the squares of x and y and returns the square root 
#of this term. This type of objective is fairly common in ML.
def objective(x, y):
    return math.sqrt((x**2 + y**2)/2)


#Use tune.run to initialize hyperparameter optimization on our training_function.
result = tune.run(
    training_function,
    config={
        #A key part is to provide a parameter space for x and y for tune to search over.
        "x": tune.grid_search([-1, -.5, 0, .5, 1]),
        "y": tune.grid_search([-1, -.5, 0, .5, 1])
    })

print(result.get_best_config(metric="score", mode="min"))

Tune Status

Current time:	2023-03-18 08:52:36
Running for:	00:02:15.24
Memory:	1.5/12.7 GiB

System Info

Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/0 GPUs, 0.0/7.37 GiB heap, 0.0/3.69 GiB objects

Trial Status

Trial name	status	loc	x	y	iter	total time (s)	score
training_function_e994b_00000	TERMINATED	172.28.0.12:13138	-1	-1	1	10.193	1
training_function_e994b_00001	TERMINATED	172.28.0.12:13186	-0.5	-1	1	10.05	0.790569
training_function_e994b_00002	TERMINATED	172.28.0.12:13138	0	-1	1	10.0499	0.707107
training_function_e994b_00003	TERMINATED	172.28.0.12:13186	0.5	-1	1	10.0483	0.790569
training_function_e994b_00004	TERMINATED	172.28.0.12:13138	1	-1	1	10.0472	1
training_function_e994b_00005	TERMINATED	172.28.0.12:13186	-1	-0.5	1	10.0501	0.790569
training_function_e994b_00006	TERMINATED	172.28.0.12:13138	-0.5	-0.5	1	10.0503	0.5
training_function_e994b_00007	TERMINATED	172.28.0.12:13186	0	-0.5	1	10.0493	0.353553
training_function_e994b_00008	TERMINATED	172.28.0.12:13138	0.5	-0.5	1	10.0502	0.5
training_function_e994b_00009	TERMINATED	172.28.0.12:13186	1	-0.5	1	10.0474	0.790569
training_function_e994b_00010	TERMINATED	172.28.0.12:13138	-1	0	1	10.0501	0.707107
training_function_e994b_00011	TERMINATED	172.28.0.12:13186	-0.5	0	1	10.0506	0.353553
training_function_e994b_00012	TERMINATED	172.28.0.12:13138	0	0	1	10.0502	0
training_function_e994b_00013	TERMINATED	172.28.0.12:13186	0.5	0	1	10.0485	0.353553
training_function_e994b_00014	TERMINATED	172.28.0.12:13138	1	0	1	10.0495	0.707107
training_function_e994b_00015	TERMINATED	172.28.0.12:13186	-1	0.5	1	10.0494	0.790569
training_function_e994b_00016	TERMINATED	172.28.0.12:13138	-0.5	0.5	1	10.0458	0.5
training_function_e994b_00017	TERMINATED	172.28.0.12:13186	0	0.5	1	10.0489	0.353553
training_function_e994b_00018	TERMINATED	172.28.0.12:13138	0.5	0.5	1	10.0503	0.5
training_function_e994b_00019	TERMINATED	172.28.0.12:13186	1	0.5	1	10.0503	0.790569
training_function_e994b_00020	TERMINATED	172.28.0.12:13138	-1	1	1	10.0499	1
training_function_e994b_00021	TERMINATED	172.28.0.12:13186	-0.5	1	1	10.0504	0.790569
training_function_e994b_00022	TERMINATED	172.28.0.12:13138	0	1	1	10.0494	0.707107
training_function_e994b_00023	TERMINATED	172.28.0.12:13186	0.5	1	1	10.0468	0.790569
training_function_e994b_00024	TERMINATED	172.28.0.12:13138	1	1	1	10.05	1

Trial Progress

Trial name	date	done	experiment_id	experiment_tag	hostname	iterations_since_restore	node_ip	pid	score	time_since_restore	time_this_iter_s	time_total_s	timestamp	training_iteration	trial_id	warmup_time
training_function_e994b_00000	2023-03-18_08-50-35	True	a2865e213e9242f5a4c2741709618e0a	0_x=-1,y=-1	0738217da70e	1	172.28.0.12	13138	1	10.193	10.193	10.193	1679129435	1	e994b_00000	0.0200734
training_function_e994b_00001	2023-03-18_08-50-38	True	96496a0a28f24fc7904fb5b942aa64f1	1_x=-0.5000,y=-1	0738217da70e	1	172.28.0.12	13186	0.790569	10.05	10.05	10.05	1679129438	1	e994b_00001	0.00650549
training_function_e994b_00002	2023-03-18_08-50-45	True	a2865e213e9242f5a4c2741709618e0a	2_x=0,y=-1	0738217da70e	1	172.28.0.12	13138	0.707107	10.0499	10.0499	10.0499	1679129445	1	e994b_00002	0.0200734
training_function_e994b_00003	2023-03-18_08-50-48	True	96496a0a28f24fc7904fb5b942aa64f1	3_x=0.5000,y=-1	0738217da70e	1	172.28.0.12	13186	0.790569	10.0483	10.0483	10.0483	1679129448	1	e994b_00003	0.00650549
training_function_e994b_00004	2023-03-18_08-50-55	True	a2865e213e9242f5a4c2741709618e0a	4_x=1,y=-1	0738217da70e	1	172.28.0.12	13138	1	10.0472	10.0472	10.0472	1679129455	1	e994b_00004	0.0200734
training_function_e994b_00005	2023-03-18_08-50-59	True	96496a0a28f24fc7904fb5b942aa64f1	5_x=-1,y=-0.5000	0738217da70e	1	172.28.0.12	13186	0.790569	10.0501	10.0501	10.0501	1679129459	1	e994b_00005	0.00650549
training_function_e994b_00006	2023-03-18_08-51-05	True	a2865e213e9242f5a4c2741709618e0a	6_x=-0.5000,y=-0.5000	0738217da70e	1	172.28.0.12	13138	0.5	10.0503	10.0503	10.0503	1679129465	1	e994b_00006	0.0200734
training_function_e994b_00007	2023-03-18_08-51-09	True	96496a0a28f24fc7904fb5b942aa64f1	7_x=0,y=-0.5000	0738217da70e	1	172.28.0.12	13186	0.353553	10.0493	10.0493	10.0493	1679129469	1	e994b_00007	0.00650549
training_function_e994b_00008	2023-03-18_08-51-15	True	a2865e213e9242f5a4c2741709618e0a	8_x=0.5000,y=-0.5000	0738217da70e	1	172.28.0.12	13138	0.5	10.0502	10.0502	10.0502	1679129475	1	e994b_00008	0.0200734
training_function_e994b_00009	2023-03-18_08-51-19	True	96496a0a28f24fc7904fb5b942aa64f1	9_x=1,y=-0.5000	0738217da70e	1	172.28.0.12	13186	0.790569	10.0474	10.0474	10.0474	1679129479	1	e994b_00009	0.00650549
training_function_e994b_00010	2023-03-18_08-51-25	True	a2865e213e9242f5a4c2741709618e0a	10_x=-1,y=0	0738217da70e	1	172.28.0.12	13138	0.707107	10.0501	10.0501	10.0501	1679129485	1	e994b_00010	0.0200734
training_function_e994b_00011	2023-03-18_08-51-29	True	96496a0a28f24fc7904fb5b942aa64f1	11_x=-0.5000,y=0	0738217da70e	1	172.28.0.12	13186	0.353553	10.0506	10.0506	10.0506	1679129489	1	e994b_00011	0.00650549
training_function_e994b_00012	2023-03-18_08-51-35	True	a2865e213e9242f5a4c2741709618e0a	12_x=0,y=0	0738217da70e	1	172.28.0.12	13138	0	10.0502	10.0502	10.0502	1679129495	1	e994b_00012	0.0200734
training_function_e994b_00013	2023-03-18_08-51-39	True	96496a0a28f24fc7904fb5b942aa64f1	13_x=0.5000,y=0	0738217da70e	1	172.28.0.12	13186	0.353553	10.0485	10.0485	10.0485	1679129499	1	e994b_00013	0.00650549
training_function_e994b_00014	2023-03-18_08-51-45	True	a2865e213e9242f5a4c2741709618e0a	14_x=1,y=0	0738217da70e	1	172.28.0.12	13138	0.707107	10.0495	10.0495	10.0495	1679129505	1	e994b_00014	0.0200734
training_function_e994b_00015	2023-03-18_08-51-49	True	96496a0a28f24fc7904fb5b942aa64f1	15_x=-1,y=0.5000	0738217da70e	1	172.28.0.12	13186	0.790569	10.0494	10.0494	10.0494	1679129509	1	e994b_00015	0.00650549
training_function_e994b_00016	2023-03-18_08-51-56	True	a2865e213e9242f5a4c2741709618e0a	16_x=-0.5000,y=0.5000	0738217da70e	1	172.28.0.12	13138	0.5	10.0458	10.0458	10.0458	1679129516	1	e994b_00016	0.0200734
training_function_e994b_00017	2023-03-18_08-51-59	True	96496a0a28f24fc7904fb5b942aa64f1	17_x=0,y=0.5000	0738217da70e	1	172.28.0.12	13186	0.353553	10.0489	10.0489	10.0489	1679129519	1	e994b_00017	0.00650549
training_function_e994b_00018	2023-03-18_08-52-06	True	a2865e213e9242f5a4c2741709618e0a	18_x=0.5000,y=0.5000	0738217da70e	1	172.28.0.12	13138	0.5	10.0503	10.0503	10.0503	1679129526	1	e994b_00018	0.0200734
training_function_e994b_00019	2023-03-18_08-52-09	True	96496a0a28f24fc7904fb5b942aa64f1	19_x=1,y=0.5000	0738217da70e	1	172.28.0.12	13186	0.790569	10.0503	10.0503	10.0503	1679129529	1	e994b_00019	0.00650549
training_function_e994b_00020	2023-03-18_08-52-16	True	a2865e213e9242f5a4c2741709618e0a	20_x=-1,y=1	0738217da70e	1	172.28.0.12	13138	1	10.0499	10.0499	10.0499	1679129536	1	e994b_00020	0.0200734
training_function_e994b_00021	2023-03-18_08-52-19	True	96496a0a28f24fc7904fb5b942aa64f1	21_x=-0.5000,y=1	0738217da70e	1	172.28.0.12	13186	0.790569	10.0504	10.0504	10.0504	1679129539	1	e994b_00021	0.00650549
training_function_e994b_00022	2023-03-18_08-52-26	True	a2865e213e9242f5a4c2741709618e0a	22_x=0,y=1	0738217da70e	1	172.28.0.12	13138	0.707107	10.0494	10.0494	10.0494	1679129546	1	e994b_00022	0.0200734
training_function_e994b_00023	2023-03-18_08-52-30	True	96496a0a28f24fc7904fb5b942aa64f1	23_x=0.5000,y=1	0738217da70e	1	172.28.0.12	13186	0.790569	10.0468	10.0468	10.0468	1679129550	1	e994b_00023	0.00650549
training_function_e994b_00024	2023-03-18_08-52-36	True	a2865e213e9242f5a4c2741709618e0a	24_x=1,y=1	0738217da70e	1	172.28.0.12	13138	1	10.05	10.05	10.05	1679129556	1	e994b_00024	0.0200734

2023-03-18 08:52:36,761	INFO tune.py:762 -- Total run time: 136.82 seconds (135.23 seconds for the tuning loop).

{'x': 0, 'y': 0}

Notice how the output of this run is structurally similar to what you saw in the RLlib example. That’s no coincidence, as RLlib (like many other Ray libraries) uses Ray Tune under the hood. If you look closely, you will see PENDING runs that wait for execution, as well as RUNNING and TERMINATED runs. Tune takes care of selecting, scheduling, and executing your training runs automatically.

Specifically, this Tune example finds the best possible choices of parameters x and y for a training_function with a given objective we want to minimize. Even though the objective function might look a little intimidating at first, since we compute the sum of squares of x and y, all values will be non-negative. That means the smallest value is obtained at x=0 and y=0, which evaluates the objective function to 0.

We do a so-called grid search over all possible parameter combinations. As we explicitly pass in 5 possible values for both x and y, that’s a total of 25 combinations that get fed into the training function. Since we instruct training_function to sleep for 10 seconds, testing all combinations of hyperparameters sequentially would take more than 4 minutes total. Since Ray is smart about parallelizing this workload, this whole experiment took only about 35 seconds for us, but it might take much longer, depending on where you run it.

Model Serving with Ray Serve¶

The last of Ray’s high-level libraries we’ll discuss specializes in model serving and is simply called Ray Serve. To see an example of it in action, you need a trained ML model to serve. Luckily, nowadays, you can find many interesting models on the internet that have already been trained for you. For instance, Hugging Face has a variety of models available for you to download directly in Python. The model we’ll use is a language model called GPT-2 that takes text as input and produces text to continue or complete the input. For example, you can prompt a question and GPT-2 will try to complete it.

Serving such a model is a good way to make it accessible. You may not know how to load and run a TensorFlow model on your computer, but you do know how to ask a question in plain English. Model serving hides the implementation details of a solution and lets users focus on providing inputs and understanding outputs of a model.

To proceed, make sure to run pip install transformers to install the Hugging Face library that has the model we want to use. With that we can now import and start an instance of Ray’s serve library, load and deploy a GPT-2 model, and ask it for the meaning of life, like so:

In [ ]:

from ray import serve
from transformers import pipeline
import requests


#Start serve locally.
serve.start()


#The @serve.deployment decorator turns a function with a request parameter into a serve deployment.
@serve.deployment
def model(request):
    #Loading language_model inside the model function for every request is inefficient, 
    #but it’s the quickest way to show you a deployment.
    language_model = pipeline("text-generation", model="gpt2")
    query = request.query_params["query"]
    #Ask the model to give us at most 100 characters to continue our query.
    return language_model(query, max_length=100)


#Formally deploy the model so that it can start receiving requests over HTTP.
model.deploy()

In [16]:

query = "What's the meaning of life?"
#Use the indispensable requests library to get a response for any question you might have.
response = requests.get(f"http://localhost:8000/model?query={query}")
print(response.text)

[{"generated_text": "What's the meaning of life?\n\nThe meaning of life is the idea that \"being alive\" isn't just a \"real life experience\". There's a lot of life around you, to be human. Life can seem strange at first and confusing at first, but it's the same when you know it is happening. When you have your life, you can be alive. And you can stay in it.\n\nHow are you feeling now?\n\nIt feels like I'm at"}]