Notebook

Grid2Op integration with ray / rllib framework¶

Try me out interactively with:

objectives This notebooks briefly explains how to use grid2op with ray (rllib) RL framework. Make sure to read the previous notebook 11_IntegrationWithExistingRLFrameworks.ipynb for a deeper dive into what happens. We only show the working solution here.

This explains the ideas and shows a "self contained" somewhat minimal example of use of ray / rllib framework with grid2op. It is not meant to be fully generic, code might need to be adjusted.

This notebook is more an "example of what works" rather than a deep dive tutorial.

See https://docs.ray.io/en/latest/rllib/rllib-env.html#configuring-environments for a more detailed information.

This notebook is tested with grid2op 1.10.2 and ray 2.24.0 (python3.10) on an ubuntu 20.04 machine.

We found that ray is highly "unstable". Documentation is not really on par with their developments rythm. Basically, this notebook works given the exact python version and ray version. If you change it then you might need to modify the calls to ray.

It is organised as followed:

0 Some tips to get started : is a reminder on what you can do to make things work. Indeed, this notebook explains "how to use grid2op with stable baselines" but not "how to create a working agent able to operate a real powergrid in real time with stable baselines". We wish we could explain the later...
1 Create the "Grid2opEnvWrapper" class : explain how to create the main grid2op env class that you can use a "gymnasium" environment.
2 Create an environment, and train a first policy: show how to create an environment from the class above (is pretty easy)
3 Evaluate the trained agent : show how to evaluate the trained "agent"
4 Some customizations: explain how to perform some customization of your agent / environment / policy

0 Some tips to get started¶

It is unlikely that "simply" using a RL algorithm on a grid2op environment will lead to good results for the vast majority of environments.

To make RL algorithms work with more or less sucess you might want to:

ajust the observation space: in particular selecting the right information for your agent. Too much information and the size of the observation space will blow up and your agent will not learn anything. Not enough information and your agent will not be able to capture anything.
customize the action space: dealing with both discrete and continuous values is often a challenge. So maybe you want to focus on only one type of action. And in all cases, try to still reduce the amount of actions your agent can perform. Indeed, for "larger" grids (118 substations, as a reference the french grid counts more than 6.000 such substations...) and by limiting 2 busbars per substation (as a reference, for some subsations, you have more than 12 such "busbars") your agent will have the opportunity to choose between more than 60.000 different discrete actions each steps. This is way too large for current RL algorithm as far as we know (and proposed environment are small in comparison to real one)
customize the reward: the default reward might not work great for you. Ultimately, what TSO's or ISO's want is to operate the grid safely, as long as possible with a cost as low as possible. This is of course really hard to catch everything in one single reward signal. Customizing the reward is also really important because the "do nothing" policy often leads to really good results (much better than random actions) which makes exploration different actions...). So you kind of want to incentivize your agent to perform some actions at some point.
use fast simulator: even if you target an industrial application with industry grade simulators, we still would advise you to use (at early stage of training at least) fast simulator for the vast majority of the training process and then maybe to fine tune on better one.
combine RL with some heuristics: it's super easy to implement things like "if there is no issue, then do nothing". This can be quite time consuming to learn though. Don't hesitate to check out the "l2rpn-baselines" repository for already "kind of working" heuristics

And finally don't hesitate to check solution proposed by winners of past l2rpn competitions in l2rpn-baselines.

You can also ask question on our discord or on our github.

1 Create the "Grid2opEnvWrapper" class¶

In the next cell, we define a custom environment (that will internally use the GymEnv grid2op class). It is not strictly needed

Indeed, in order to work with ray / rllib you need to define a custom wrapper on top of the GymEnv wrapper. You then have:

self._g2op_env which is the default grid2op environment, receiving grid2op Action and producing grid2op Observation.
self._gym_env which is a the grid2op defined gymnasium Environment that cannot be directly used with ray / rllib
Grid2opEnvWrapper which is a the wrapper on top of self._gym_env to make it usable with ray / rllib.

Ray / rllib expects the gymnasium environment to inherit from gymnasium.Env and to be initialized with a given configuration. This is why you need to create the Grid2opEnvWrapper wrapper on top of GymEnv.

In the initialization of Grid2opEnvWrapper, the env_config variable is a dictionary that can take as key-word arguments:

backend_cls : what is the class of the backend. If not provided, it will use LightSimBackend from the lightsim2grid package
backend_options: what options will be used to create the backend for your environment. Your backend will be created by calling backend_cls(**backend_options), for example if you want to build LightSimBackend(detailed_info_for_cascading_failure=False) you can pass {"backend_cls": LightSimBackend, "backend_options": {"detailed_info_for_cascading_failure": False}}
env_name : name of the grid2op environment you want to use, by default it uses "l2rpn_case14_sandbox"
env_is_test : whether to add test=True when creating the grid2op environment (if env_is_test is True it will add test=True when calling grid2op.make(..., test=True)) otherwise it uses test=False
obs_attr_to_keep : in this wrapper we only allow your agent to see a Box as an observation. This parameter allows you to control which attributes of the grid2op observation will be present in the agent observation space. By default it's ["rho", "p_or", "gen_p", "load_p"] which is "kind of random" and is probably not suited for every agent.
act_type : controls the type of actions your agent will be able to perform. Already coded in this notebook are:
- "discrete" to use a Discrete action space
- "box" to use a Box action space
- "multi_discrete" to use a MultiDiscrete action space
act_attr_to_keep : that allows you to customize the action space. If not provided, it defaults to:
- ["set_line_status_simple", "set_bus"] if act_type is "discrete"
- ["redispatch", "set_storage", "curtail"] if act_type is "box"
- ["one_line_set", "one_sub_set"] if act_type is "multi_discrete"

If you want to add more customization, for example the reward function, the parameters of the environment etc. etc. feel free to get inspired by this code and extend it. Any PR on this regard is more than welcome.

In [ ]:

from gymnasium import Env
from gymnasium.spaces import Discrete, MultiDiscrete, Box
import json

import ray
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.algorithms import ppo

from typing import Dict, Literal, Any
import copy

import grid2op
from grid2op.gym_compat import GymEnv, BoxGymObsSpace, DiscreteActSpace, BoxGymActSpace, MultiDiscreteActSpace
from lightsim2grid import LightSimBackend


class Grid2opEnvWrapper(Env):
    def __init__(self,
                 env_config: Dict[Literal["backend_cls",
                                          "backend_options",
                                          "env_name",
                                          "env_is_test",
                                          "obs_attr_to_keep",
                                          "act_type",
                                          "act_attr_to_keep"],
                                  Any]= None):
        super().__init__()
        if env_config is None:
            env_config = {}

        # handle the backend
        backend_cls = LightSimBackend
        if "backend_cls" in env_config:
            backend_cls = env_config["backend_cls"]
        backend_options = {}
        if "backend_options" in env_config:
            backend_options = env_config["backend_options"]
        backend = backend_cls(**backend_options)

        # create the grid2op environment
        env_name = "l2rpn_case14_sandbox"
        if "env_name" in env_config:
            env_name = env_config["env_name"]
        if "env_is_test" in env_config:
            is_test = bool(env_config["env_is_test"])
        else:
            is_test = False
        self._g2op_env = grid2op.make(env_name, backend=backend, test=is_test)
        # NB by default this might be really slow (when the environment is reset)
        # see https://grid2op.readthedocs.io/en/latest/data_pipeline.html for maybe 10x speed ups !
        # TODO customize reward or action_class for example !

        # create the gym env (from grid2op)
        self._gym_env = GymEnv(self._g2op_env)

        # customize observation space
        obs_attr_to_keep = ["rho", "p_or", "gen_p", "load_p"]
        if "obs_attr_to_keep" in env_config:
            obs_attr_to_keep = copy.deepcopy(env_config["obs_attr_to_keep"])
        self._gym_env.observation_space.close()
        self._gym_env.observation_space = BoxGymObsSpace(self._g2op_env.observation_space,
                                                         attr_to_keep=obs_attr_to_keep
                                                         )
        # export observation space for the Grid2opEnv
        self.observation_space = Box(shape=self._gym_env.observation_space.shape,
                                     low=self._gym_env.observation_space.low,
                                     high=self._gym_env.observation_space.high)

        # customize the action space
        act_type = "discrete"
        if "act_type" in env_config:
            act_type = env_config["act_type"]

        self._gym_env.action_space.close()
        if act_type == "discrete":
            # user wants a discrete action space
            act_attr_to_keep =  ["set_line_status_simple", "set_bus"]
            if "act_attr_to_keep" in env_config:
                act_attr_to_keep = copy.deepcopy(env_config["act_attr_to_keep"])
            self._gym_env.action_space = DiscreteActSpace(self._g2op_env.action_space,
                                                          attr_to_keep=act_attr_to_keep)
            self.action_space = Discrete(self._gym_env.action_space.n)
        elif act_type == "box":
            # user wants continuous action space
            act_attr_to_keep =  ["redispatch", "set_storage", "curtail"]
            if "act_attr_to_keep" in env_config:
                act_attr_to_keep = copy.deepcopy(env_config["act_attr_to_keep"])
            self._gym_env.action_space = BoxGymActSpace(self._g2op_env.action_space,
                                                        attr_to_keep=act_attr_to_keep)
            self.action_space = Box(shape=self._gym_env.action_space.shape,
                                    low=self._gym_env.action_space.low,
                                    high=self._gym_env.action_space.high)
        elif act_type == "multi_discrete":
            # user wants a multi-discrete action space
            act_attr_to_keep = ["one_line_set", "one_sub_set"]
            if "act_attr_to_keep" in env_config:
                act_attr_to_keep = copy.deepcopy(env_config["act_attr_to_keep"])
            self._gym_env.action_space = MultiDiscreteActSpace(self._g2op_env.action_space,
                                                               attr_to_keep=act_attr_to_keep)
            self.action_space = MultiDiscrete(self._gym_env.action_space.nvec)
        else:
            raise NotImplementedError(f"action type '{act_type}' is not currently supported.")
            
    def reset(self, seed=None, options=None):
        # use default _gym_env (from grid2op.gym_compat module)
        # NB: here you can also specify "default options" when you reset, for example:
        # - limiting the duration of the episode "max step"
        # - starting at different steps  "init ts"
        # - study difficult scenario   "time serie id"
        # - specify an initial state of your grid "init state"
        return self._gym_env.reset(seed=seed, options=options)
        
    def step(self, action):
        # use default _gym_env (from grid2op.gym_compat module)
        return self._gym_env.step(action)
        

2 Create an environment, and train a first policy¶

Now we init ray, because we need to.

In [ ]:

ray.init()

In [ ]:

# example of the documentation, directly
# see https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.html

# Construct a generic config object, specifying values within different
# sub-categories, e.g. "training".
env_config = {}
config = (PPOConfig().training(gamma=0.9, lr=0.01)
          .environment(env=Grid2opEnvWrapper, env_config=env_config)
          .resources(num_gpus=0)
          .env_runners(num_env_runners=0)
          .framework("tf2")
         )

# A config object can be used to construct the respective Algorithm.
rllib_algo = config.build()

Now we train it for one training iteration (might call env.reset() and env.step() multiple times, see ray's documentation for a better understanding of what happens here and don't hesitate to open an issue or a PR to explain it and we'll add it here, thanks)

In [ ]:

print(rllib_algo.train())

3 Evaluate the trained agent¶

This notebook is a simple quick introduction for stable baselines only. So we don't really recall everything that has been said previously.

Please consult the section 0) Recommended initial steps of the notebook 11_IntegrationWithExistingRLFrameworks for more information.

TLD;DR grid2op offers the possibility to test your agent on scenarios / episodes different from the one it has been trained. We greatly encourage you to use this functionality.

There are two main ways to evaluate your agent:

you stay in the "gymnasium" world (see here ) and you evaluate your policy directly just like you would any other gymnasium compatible environment. Simple, easy but without support for some grid2op features
you "get back" to the "grid2op" world (detailed here) by "converting" your NN policy into something that is able to output grid2op like action. This introduces yet again a "wrapper" but you can benefit from all grid2op features, such as the Runner to save an inspect what your policy has done.

We show here just a simple examples to "get easily started". For much better working agents, you can have a look at l2rpn-baselines code. There you have classes that maps the environment, the agents etc. to grid2op directly (you don't have to copy paste any wrapper).

3.1 staying in the gymnasium ecosystem¶

You can do pretty much what you want, but you have to do it yourself, or use any of the "Wrappers" available in gymnasium https://gymnasium.farama.org/main/api/wrappers/ (eg https://gymnasium.farama.org/main/api/wrappers/misc_wrappers/#gymnasium.wrappers.RecordEpisodeStatistics) or in your RL framework.

For the sake of simplicity, we show how to do things "manually" even though we do not recommend to do it like that.

In [ ]:

nb_episode_test = 2
seeds_test_env = (0, 1)    # same size as nb_episode_test
seeds_test_agent = (3, 4)  # same size as nb_episode_test
ts_ep_test =  (0, 1)       # same size as nb_episode_test
gym_env = Grid2opEnvWrapper(env_config)

In [ ]:

ep_infos = {}  # information that will be saved


for ep_test_num in range(nb_episode_test):
    init_obs, init_infos = gym_env.reset(seed=seeds_test_env[ep_test_num],
                                         options={"time serie id": ts_ep_test[ep_test_num]})
    # TODO seed the agent, I did not found in ray doc how to do it
    done = False
    cum_reward = 0
    step_survived = 0
    obs = init_obs
    while not done:
        act = rllib_algo.compute_single_action(obs, explore=False)
        obs, reward, terminated, truncated, info = gym_env.step(act)
        step_survived += 1
        cum_reward += float(reward)
        done = terminated or truncated
    ep_infos[ep_test_num] = {"time serie id": ts_ep_test[ep_test_num],
                             "time serie folder": gym_env._gym_env.init_env.chronics_handler.get_id(),
                             "env seed": seeds_test_env[ep_test_num],
                             "agent seed": seeds_test_agent[ep_test_num],
                             "steps survived": step_survived,
                             "total steps": int(gym_env._gym_env.init_env.max_episode_duration()),
                             "cum reward": cum_reward}

In [ ]:

# "prettyprint" the dictionnary above

print(json.dumps(ep_infos, indent=4))

As you might have seen, it's not easy this way to retrieve some useful information about the grid2op environment if these informations are not passed to the policy.

For example, we need to call gym_env._gym_env.init_env to access the underlying grid2op environment... You have to convert some things from int32 or float32 to float or int otherwise json complains, you have to control yourself the seeds to have reproducible results etc.

It's a quick way to have something working but it might be perfected.

3.2 using the grid2op ecosystem¶

This second method brings it closer to grid2op ecosystem, you will be able to use it with the grid2op Runner, save the results and read it back with other tools such as grid2viz and do the evaluation in parrallel without too much trouble (and with high reproducibility).

With this method, you build a grid2op agent and this agent can then be used like every other grid2op agent. For example you can compare it with heuristic agents, agent based on optimization etc.

This way of doing things also allows you to customize when the neural network policy is used. For example, you might chose to use it only when the grid is "unsafe" (and if the grid is safe you use an "expert" rules).

This is more flexible than the previous one.

In [ ]:

from grid2op.Agent import BaseAgent
from grid2op.Runner import Runner

class Grid2opAgentWrapper(BaseAgent):
    def __init__(self,
                 gym_env: Grid2opEnvWrapper,
                 trained_agent):
        self.gym_env = gym_env
        BaseAgent.__init__(self, gym_env._gym_env.init_env.action_space)
        self.trained_agent = trained_agent
        
    def act(self, obs, reward, done):
        # you can customize it here to call the NN policy `trained_agent`
        # only in some cases, depending on the observation for example
        gym_obs = self.gym_env._gym_env.observation_space.to_gym(obs)
        gym_act = self.trained_agent.compute_single_action(gym_obs, explore=False)
        grid2op_act = self.gym_env._gym_env.action_space.from_gym(gym_act)
        return grid2op_act
    
    def seed(self, seed):
        # implement the seed function
        # TODO
        return

In [ ]:

my_agent = Grid2opAgentWrapper(gym_env, rllib_algo)
runner = Runner(**gym_env._g2op_env.get_params_for_runner(),
                agentClass=None,
                agentInstance=my_agent)

In [ ]:

res = runner.run(nb_episode=nb_episode_test,
                 env_seeds=seeds_test_env,
                 agent_seeds=seeds_test_agent,
                 episode_id=ts_ep_test,
                 add_detailed_output=True
                 )

In [ ]:

res

4 some customizations¶

4.1 Train a PPO agent using 2 "runners" to make the rollouts¶

In this second example, we explain briefly how to train the model using 2 "processes". This is, the agent will interact with 2 agents at the same time during the "rollout" phases.

But everything related to the training of the agent is still done on the main process (and in this case not using a GPU but only a CPU).

In [ ]:

# see https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.html

# use multiple runners
config2 = (PPOConfig().training(gamma=0.9, lr=0.01)
           .environment(env=Grid2opEnvWrapper, env_config={})
           .resources(num_gpus=0)
           .env_runners(num_env_runners=2, num_envs_per_env_runner=1, num_cpus_per_env_runner=1)
           .framework("tf2")
          )

# A config object can be used to construct the respective Algorithm.
rllib_algo2 = config2.build()

Now we train it for one training iteration (might call env.reset() and env.step() multiple times)

In [ ]:

print(rllib_algo2.train())

4.2 Use non default parameters to make the grid2op environment¶

In this third example, we will train a policy using the "box" action space, and on another environment (l2rpn_idf_2023 instead of l2rpn_case14_sandbox)

In [ ]:

# see https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.html

# Use a "Box" action space (mainly to use redispatching, curtailment and storage units)
env_config3 = {"env_name": "l2rpn_idf_2023",
               "env_is_test": True,
               "act_type": "box",
              }
config3 = (PPOConfig().training(gamma=0.9, lr=0.01)
           .environment(env=Grid2opEnvWrapper, env_config=env_config3)
           .resources(num_gpus=0)
           .env_runners(num_env_runners=2, num_envs_per_env_runner=1, num_cpus_per_env_runner=1)
           .framework("tf2")
          )

# A config object can be used to construct the respective Algorithm.
rllib_algo3 = config3.build()

Now we train it for one training iteration (might call env.reset() and env.step() multiple times)

In [ ]:

print(rllib_algo3.train())

And now a policy using the "multi discrete" action space:

In [ ]:

# see https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.html

# Use a "Box" action space (mainly to use redispatching, curtailment and storage units)
env_config4 = {"env_name": "l2rpn_idf_2023",
               "env_is_test": True,
               "act_type": "multi_discrete",
               }
config4 = (PPOConfig().training(gamma=0.9, lr=0.01)
           .environment(env=Grid2opEnvWrapper, env_config=env_config4)
           .resources(num_gpus=0)
           .env_runners(num_env_runners=2, num_envs_per_env_runner=1, num_cpus_per_env_runner=1)
           .framework("tf2")
          )

# A config object can be used to construct the respective Algorithm.
rllib_algo4 = config4.build()

Now we train it for one training iteration (might call env.reset() and env.step() multiple times)

In [ ]:

print(rllib_algo4.train())

4.3 Customize the policy (number of layers, size of layers etc.)¶

This notebook does not aim at covering all possibilities offered by ray / rllib. For that you need to refer to the ray / rllib documentation.

We will simply show how to change the size of the neural network used as a policy.

In [ ]:

# see https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.html

# Use a "Box" action space (mainly to use redispatching, curtailment and storage units)
config5 = (PPOConfig().training(gamma=0.9, lr=0.01)
           .environment(env=Grid2opEnvWrapper, env_config={})
           .resources(num_gpus=0)
           .env_runners(num_env_runners=2, num_envs_per_env_runner=1, num_cpus_per_env_runner=1)
           .framework("tf2")
           .rl_module(
             model_config_dict={"fcnet_hiddens": [32, 32, 32]},  # 3 layers (fully connected) of 32 units each
           )
          )

# A config object can be used to construct the respective Algorithm.
rllib_algo5 = config5.build()

Now we train it for one training iteration (might call env.reset() and env.step() multiple times)

In [ ]:

print(rllib_algo5.train())