Try me out interactively with:
objectives This notebooks briefly explains how to use grid2op with ray (rllib) RL framework. Make sure to read the previous notebook 11_IntegrationWithExistingRLFrameworks.ipynb for a deeper dive into what happens. We only show the working solution here.
This explains the ideas and shows a "self contained" somewhat minimal example of use of ray / rllib framework with grid2op. It is not meant to be fully generic, code might need to be adjusted.
This notebook is more an "example of what works" rather than a deep dive tutorial.
See https://docs.ray.io/en/latest/rllib/rllib-env.html#configuring-environments for a more detailed information.
See also https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.html for other details
This notebook is tested with grid2op 1.10.2 and ray 2.24.0 (python3.10) on an ubuntu 20.04 machine.
We found that ray is highly "unstable". Documentation is not really on par with their developments rythm. Basically, this notebook works given the exact python version and ray version. If you change it then you might need to modify the calls to ray.
It is organised as followed:
It is unlikely that "simply" using a RL algorithm on a grid2op environment will lead to good results for the vast majority of environments.
To make RL algorithms work with more or less sucess you might want to:
ajust the observation space: in particular selecting the right information for your agent. Too much information and the size of the observation space will blow up and your agent will not learn anything. Not enough information and your agent will not be able to capture anything.
customize the action space: dealing with both discrete and continuous values is often a challenge. So maybe you want to focus on only one type of action. And in all cases, try to still reduce the amount of actions your agent can perform. Indeed, for "larger" grids (118 substations, as a reference the french grid counts more than 6.000 such substations...) and by limiting 2 busbars per substation (as a reference, for some subsations, you have more than 12 such "busbars") your agent will have the opportunity to choose between more than 60.000 different discrete actions each steps. This is way too large for current RL algorithm as far as we know (and proposed environment are small in comparison to real one)
customize the reward: the default reward might not work great for you. Ultimately, what TSO's or ISO's want is to operate the grid safely, as long as possible with a cost as low as possible. This is of course really hard to catch everything in one single reward signal. Customizing the reward is also really important because the "do nothing" policy often leads to really good results (much better than random actions) which makes exploration different actions...). So you kind of want to incentivize your agent to perform some actions at some point.
use fast simulator: even if you target an industrial application with industry grade simulators, we still would advise you to use (at early stage of training at least) fast simulator for the vast majority of the training process and then maybe to fine tune on better one.
combine RL with some heuristics: it's super easy to implement things like "if there is no issue, then do nothing". This can be quite time consuming to learn though. Don't hesitate to check out the "l2rpn-baselines" repository for already "kind of working" heuristics
And finally don't hesitate to check solution proposed by winners of past l2rpn competitions in l2rpn-baselines.
You can also ask question on our discord or on our github.
In the next cell, we define a custom environment (that will internally use the GymEnv
grid2op class). It is not strictly needed
Indeed, in order to work with ray / rllib you need to define a custom wrapper on top of the GymEnv wrapper. You then have:
gymnasium Environment
that cannot be directly used with ray / rllibGrid2opEnvWrapper
which is a the wrapper on top of self._gym_env
to make it usable with ray / rllib.Ray / rllib expects the gymnasium environment to inherit from gymnasium.Env
and to be initialized with a given configuration. This is why you need to create the Grid2opEnvWrapper
wrapper on top of GymEnv
.
In the initialization of Grid2opEnvWrapper
, the env_config
variable is a dictionary that can take as key-word arguments:
backend_cls
: what is the class of the backend. If not provided, it will use LightSimBackend
from the lightsim2grid
packagebackend_options
: what options will be used to create the backend for your environment. Your backend will be created by calling
backend_cls(**backend_options)
, for example if you want to build LightSimBackend(detailed_info_for_cascading_failure=False)
you can pass {"backend_cls": LightSimBackend, "backend_options": {"detailed_info_for_cascading_failure": False}}
env_name
: name of the grid2op environment you want to use, by default it uses "l2rpn_case14_sandbox"
env_is_test
: whether to add test=True
when creating the grid2op environment (if env_is_test
is True it will add test=True
when calling grid2op.make(..., test=True)
) otherwise it uses test=False
obs_attr_to_keep
: in this wrapper we only allow your agent to see a Box as an observation. This parameter allows you to control which attributes of the grid2op observation will be present in the agent observation space. By default it's ["rho", "p_or", "gen_p", "load_p"]
which is "kind of random" and is probably not suited for every agent.act_type
: controls the type of actions your agent will be able to perform. Already coded in this notebook are:"discrete"
to use a Discrete
action space"box"
to use a Box
action space"multi_discrete"
to use a MultiDiscrete
action spaceact_attr_to_keep
: that allows you to customize the action space. If not provided, it defaults to:["set_line_status_simple", "set_bus"]
if act_type
is "discrete"
["redispatch", "set_storage", "curtail"]
if act_type
is "box"
["one_line_set", "one_sub_set"]
if act_type
is "multi_discrete"
If you want to add more customization, for example the reward function, the parameters of the environment etc. etc. feel free to get inspired by this code and extend it. Any PR on this regard is more than welcome.
from gymnasium import Env
from gymnasium.spaces import Discrete, MultiDiscrete, Box
import json
import ray
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.algorithms import ppo
from typing import Dict, Literal, Any
import copy
import grid2op
from grid2op.gym_compat import GymEnv, BoxGymObsSpace, DiscreteActSpace, BoxGymActSpace, MultiDiscreteActSpace
from lightsim2grid import LightSimBackend
class Grid2opEnvWrapper(Env):
def __init__(self,
env_config: Dict[Literal["backend_cls",
"backend_options",
"env_name",
"env_is_test",
"obs_attr_to_keep",
"act_type",
"act_attr_to_keep"],
Any]= None):
super().__init__()
if env_config is None:
env_config = {}
# handle the backend
backend_cls = LightSimBackend
if "backend_cls" in env_config:
backend_cls = env_config["backend_cls"]
backend_options = {}
if "backend_options" in env_config:
backend_options = env_config["backend_options"]
backend = backend_cls(**backend_options)
# create the grid2op environment
env_name = "l2rpn_case14_sandbox"
if "env_name" in env_config:
env_name = env_config["env_name"]
if "env_is_test" in env_config:
is_test = bool(env_config["env_is_test"])
else:
is_test = False
self._g2op_env = grid2op.make(env_name, backend=backend, test=is_test)
# NB by default this might be really slow (when the environment is reset)
# see https://grid2op.readthedocs.io/en/latest/data_pipeline.html for maybe 10x speed ups !
# TODO customize reward or action_class for example !
# create the gym env (from grid2op)
self._gym_env = GymEnv(self._g2op_env)
# customize observation space
obs_attr_to_keep = ["rho", "p_or", "gen_p", "load_p"]
if "obs_attr_to_keep" in env_config:
obs_attr_to_keep = copy.deepcopy(env_config["obs_attr_to_keep"])
self._gym_env.observation_space.close()
self._gym_env.observation_space = BoxGymObsSpace(self._g2op_env.observation_space,
attr_to_keep=obs_attr_to_keep
)
# export observation space for the Grid2opEnv
self.observation_space = Box(shape=self._gym_env.observation_space.shape,
low=self._gym_env.observation_space.low,
high=self._gym_env.observation_space.high)
# customize the action space
act_type = "discrete"
if "act_type" in env_config:
act_type = env_config["act_type"]
self._gym_env.action_space.close()
if act_type == "discrete":
# user wants a discrete action space
act_attr_to_keep = ["set_line_status_simple", "set_bus"]
if "act_attr_to_keep" in env_config:
act_attr_to_keep = copy.deepcopy(env_config["act_attr_to_keep"])
self._gym_env.action_space = DiscreteActSpace(self._g2op_env.action_space,
attr_to_keep=act_attr_to_keep)
self.action_space = Discrete(self._gym_env.action_space.n)
elif act_type == "box":
# user wants continuous action space
act_attr_to_keep = ["redispatch", "set_storage", "curtail"]
if "act_attr_to_keep" in env_config:
act_attr_to_keep = copy.deepcopy(env_config["act_attr_to_keep"])
self._gym_env.action_space = BoxGymActSpace(self._g2op_env.action_space,
attr_to_keep=act_attr_to_keep)
self.action_space = Box(shape=self._gym_env.action_space.shape,
low=self._gym_env.action_space.low,
high=self._gym_env.action_space.high)
elif act_type == "multi_discrete":
# user wants a multi-discrete action space
act_attr_to_keep = ["one_line_set", "one_sub_set"]
if "act_attr_to_keep" in env_config:
act_attr_to_keep = copy.deepcopy(env_config["act_attr_to_keep"])
self._gym_env.action_space = MultiDiscreteActSpace(self._g2op_env.action_space,
attr_to_keep=act_attr_to_keep)
self.action_space = MultiDiscrete(self._gym_env.action_space.nvec)
else:
raise NotImplementedError(f"action type '{act_type}' is not currently supported.")
def reset(self, seed=None, options=None):
# use default _gym_env (from grid2op.gym_compat module)
# NB: here you can also specify "default options" when you reset, for example:
# - limiting the duration of the episode "max step"
# - starting at different steps "init ts"
# - study difficult scenario "time serie id"
# - specify an initial state of your grid "init state"
return self._gym_env.reset(seed=seed, options=options)
def step(self, action):
# use default _gym_env (from grid2op.gym_compat module)
return self._gym_env.step(action)
Now we init ray, because we need to.
ray.init()
# example of the documentation, directly
# see https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.html
# Construct a generic config object, specifying values within different
# sub-categories, e.g. "training".
env_config = {}
config = (PPOConfig().training(gamma=0.9, lr=0.01)
.environment(env=Grid2opEnvWrapper, env_config=env_config)
.resources(num_gpus=0)
.env_runners(num_env_runners=0)
.framework("tf2")
)
# A config object can be used to construct the respective Algorithm.
rllib_algo = config.build()
Now we train it for one training iteration (might call env.reset()
and env.step()
multiple times, see ray's documentation for a better understanding of what happens here and don't hesitate to open an issue or a PR to explain it and we'll add it here, thanks)
print(rllib_algo.train())
This notebook is a simple quick introduction for stable baselines only. So we don't really recall everything that has been said previously.
Please consult the section 0) Recommended initial steps
of the notebook 11_IntegrationWithExistingRLFrameworks for more information.
TLD;DR grid2op offers the possibility to test your agent on scenarios / episodes different from the one it has been trained. We greatly encourage you to use this functionality.
There are two main ways to evaluate your agent:
Runner
to save an inspect what your policy has done.We show here just a simple examples to "get easily started". For much better working agents, you can have a look at l2rpn-baselines code. There you have classes that maps the environment, the agents etc. to grid2op directly (you don't have to copy paste any wrapper).
You can do pretty much what you want, but you have to do it yourself, or use any of the "Wrappers" available in gymnasium https://gymnasium.farama.org/main/api/wrappers/ (eg https://gymnasium.farama.org/main/api/wrappers/misc_wrappers/#gymnasium.wrappers.RecordEpisodeStatistics) or in your RL framework.
For the sake of simplicity, we show how to do things "manually" even though we do not recommend to do it like that.
nb_episode_test = 2
seeds_test_env = (0, 1) # same size as nb_episode_test
seeds_test_agent = (3, 4) # same size as nb_episode_test
ts_ep_test = (0, 1) # same size as nb_episode_test
gym_env = Grid2opEnvWrapper(env_config)
ep_infos = {} # information that will be saved
for ep_test_num in range(nb_episode_test):
init_obs, init_infos = gym_env.reset(seed=seeds_test_env[ep_test_num],
options={"time serie id": ts_ep_test[ep_test_num]})
# TODO seed the agent, I did not found in ray doc how to do it
done = False
cum_reward = 0
step_survived = 0
obs = init_obs
while not done:
act = rllib_algo.compute_single_action(obs, explore=False)
obs, reward, terminated, truncated, info = gym_env.step(act)
step_survived += 1
cum_reward += float(reward)
done = terminated or truncated
ep_infos[ep_test_num] = {"time serie id": ts_ep_test[ep_test_num],
"time serie folder": gym_env._gym_env.init_env.chronics_handler.get_id(),
"env seed": seeds_test_env[ep_test_num],
"agent seed": seeds_test_agent[ep_test_num],
"steps survived": step_survived,
"total steps": int(gym_env._gym_env.init_env.max_episode_duration()),
"cum reward": cum_reward}
# "prettyprint" the dictionnary above
print(json.dumps(ep_infos, indent=4))
As you might have seen, it's not easy this way to retrieve some useful information about the grid2op environment if these informations are not passed to the policy.
For example, we need to call gym_env._gym_env.init_env
to access the underlying grid2op environment... You have to convert some things from int32 or float32 to float or int otherwise json complains, you have to control yourself the seeds to have reproducible results etc.
It's a quick way to have something working but it might be perfected.
This second method brings it closer to grid2op ecosystem, you will be able to use it with the grid2op Runner
, save the results and read it back with other tools such as grid2viz and do the evaluation in parrallel without too much trouble (and with high reproducibility).
With this method, you build a grid2op agent and this agent can then be used like every other grid2op agent. For example you can compare it with heuristic agents, agent based on optimization etc.
This way of doing things also allows you to customize when the neural network policy is used. For example, you might chose to use it only when the grid is "unsafe" (and if the grid is safe you use an "expert" rules).
This is more flexible than the previous one.
from grid2op.Agent import BaseAgent
from grid2op.Runner import Runner
class Grid2opAgentWrapper(BaseAgent):
def __init__(self,
gym_env: Grid2opEnvWrapper,
trained_agent):
self.gym_env = gym_env
BaseAgent.__init__(self, gym_env._gym_env.init_env.action_space)
self.trained_agent = trained_agent
def act(self, obs, reward, done):
# you can customize it here to call the NN policy `trained_agent`
# only in some cases, depending on the observation for example
gym_obs = self.gym_env._gym_env.observation_space.to_gym(obs)
gym_act = self.trained_agent.compute_single_action(gym_obs, explore=False)
grid2op_act = self.gym_env._gym_env.action_space.from_gym(gym_act)
return grid2op_act
def seed(self, seed):
# implement the seed function
# TODO
return
my_agent = Grid2opAgentWrapper(gym_env, rllib_algo)
runner = Runner(**gym_env._g2op_env.get_params_for_runner(),
agentClass=None,
agentInstance=my_agent)
res = runner.run(nb_episode=nb_episode_test,
env_seeds=seeds_test_env,
agent_seeds=seeds_test_agent,
episode_id=ts_ep_test,
add_detailed_output=True
)
res
In this second example, we explain briefly how to train the model using 2 "processes". This is, the agent will interact with 2 agents at the same time during the "rollout" phases.
But everything related to the training of the agent is still done on the main process (and in this case not using a GPU but only a CPU).
# see https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.html
# use multiple runners
config2 = (PPOConfig().training(gamma=0.9, lr=0.01)
.environment(env=Grid2opEnvWrapper, env_config={})
.resources(num_gpus=0)
.env_runners(num_env_runners=2, num_envs_per_env_runner=1, num_cpus_per_env_runner=1)
.framework("tf2")
)
# A config object can be used to construct the respective Algorithm.
rllib_algo2 = config2.build()
Now we train it for one training iteration (might call env.reset()
and env.step()
multiple times)
print(rllib_algo2.train())
In this third example, we will train a policy using the "box" action space, and on another environment (l2rpn_idf_2023
instead of l2rpn_case14_sandbox
)
# see https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.html
# Use a "Box" action space (mainly to use redispatching, curtailment and storage units)
env_config3 = {"env_name": "l2rpn_idf_2023",
"env_is_test": True,
"act_type": "box",
}
config3 = (PPOConfig().training(gamma=0.9, lr=0.01)
.environment(env=Grid2opEnvWrapper, env_config=env_config3)
.resources(num_gpus=0)
.env_runners(num_env_runners=2, num_envs_per_env_runner=1, num_cpus_per_env_runner=1)
.framework("tf2")
)
# A config object can be used to construct the respective Algorithm.
rllib_algo3 = config3.build()
Now we train it for one training iteration (might call env.reset()
and env.step()
multiple times)
print(rllib_algo3.train())
And now a policy using the "multi discrete" action space:
# see https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.html
# Use a "Box" action space (mainly to use redispatching, curtailment and storage units)
env_config4 = {"env_name": "l2rpn_idf_2023",
"env_is_test": True,
"act_type": "multi_discrete",
}
config4 = (PPOConfig().training(gamma=0.9, lr=0.01)
.environment(env=Grid2opEnvWrapper, env_config=env_config4)
.resources(num_gpus=0)
.env_runners(num_env_runners=2, num_envs_per_env_runner=1, num_cpus_per_env_runner=1)
.framework("tf2")
)
# A config object can be used to construct the respective Algorithm.
rllib_algo4 = config4.build()
Now we train it for one training iteration (might call env.reset()
and env.step()
multiple times)
print(rllib_algo4.train())
This notebook does not aim at covering all possibilities offered by ray / rllib. For that you need to refer to the ray / rllib documentation.
We will simply show how to change the size of the neural network used as a policy.
# see https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.html
# Use a "Box" action space (mainly to use redispatching, curtailment and storage units)
config5 = (PPOConfig().training(gamma=0.9, lr=0.01)
.environment(env=Grid2opEnvWrapper, env_config={})
.resources(num_gpus=0)
.env_runners(num_env_runners=2, num_envs_per_env_runner=1, num_cpus_per_env_runner=1)
.framework("tf2")
.rl_module(
model_config_dict={"fcnet_hiddens": [32, 32, 32]}, # 3 layers (fully connected) of 32 units each
)
)
# A config object can be used to construct the respective Algorithm.
rllib_algo5 = config5.build()
Now we train it for one training iteration (might call env.reset()
and env.step()
multiple times)
print(rllib_algo5.train())