Try me out interactively with:
objectives This notebooks briefly explains how to use grid2op with stable baselines 3 RL framework. Make sure to read the previous notebook 11_IntegrationWithExistingRLFrameworks for a deeper dive into what happens. We only show the working solution here.
This explains the ideas and shows a "self contained" somewhat minimal example of use of stable baselines 3 framework with grid2op. It is not meant to be fully generic, code might need to be adjusted.
This notebook is more an "example of what works" rather than a deep dive tutorial.
See stable-baselines3.readthedocs.io/ for a more detailed information.
This notebook is tested with grid2op 1.10 and stable baselines 2.3.2 on an ubuntu 20.04 machine.
It is organised as followed:
It is unlikely that "simply" using a RL algorithm on a grid2op environment will lead to good results for the vast majority of environments.
To make RL algorithms work with more or less sucess you might want to:
ajust the observation space: in particular selecting the right information for your agent. Too much information and the size of the observation space will blow up and your agent will not learn anything. Not enough information and your agent will not be able to capture anything.
customize the action space: dealing with both discrete and continuous values is often a challenge. So maybe you want to focus on only one type of action. And in all cases, try to still reduce the amount of actions your agent can perform. Indeed, for "larger" grids (118 substations, as a reference the french grid counts more than 6.000 such substations...) and by limiting 2 busbars per substation (as a reference, for some subsations, you have more than 12 such "busbars") your agent will have the opportunity to choose between more than 60.000 different discrete actions each steps. This is way too large for current RL algorithm as far as we know (and proposed environment are small in comparison to real one)
customize the reward: the default reward might not work great for you. Ultimately, what TSO's or ISO's want is to operate the grid safely, as long as possible with a cost as low as possible. This is of course really hard to catch everything in one single reward signal. Customizing the reward is also really important because the "do nothing" policy often leads to really good results (much better than random actions) which makes exploration different actions...). So you kind of want to incentivize your agent to perform some actions at some point.
use fast simulator: even if you target an industrial application with industry grade simulators, we still would advise you to use (at early stage of training at least) fast simulator for the vast majority of the training process and then maybe to fine tune on better one.
combine RL with some heuristics: it's super easy to implement things like "if there is no issue, then do nothing". This can be quite time consuming to learn though. Don't hesitate to check out the "l2rpn-baselines" repository for already "kind of working" heuristics
And finally don't hesitate to check solution proposed by winners of past l2rpn competitions in l2rpn-baselines.
You can also ask question on our discord or on our github.
In the next cell, we define a custom environment (that will internally use the GymEnv
grid2op class) that is needed for ray / rllib.
Indeed, in order to work with ray / rllib you need to define a custom wrapper on top of the GymEnv wrapper. You then have:
gymnasium Environment
that cannot be directly used with ray / rllibGrid2opEnv
which is a the wrapper on top of self._gym_env
to make it usable with ray / rllib.Ray / rllib expects the gymnasium environment to inherit from gymnasium.Env
and to be initialized with a given configuration. This is why you need to create the Grid2opEnv
wrapper on top of GymEnv
.
In the initialization of Grid2opEnv
, the env_config
variable is a dictionary that can take as key-word arguments:
backend_cls
: what is the class of the backend. If not provided, it will use LightSimBackend
from the lightsim2grid
packagebackend_options
: what options will be used to create the backend for your environment. Your backend will be created by calling
backend_cls(**backend_options)
, for example if you want to build LightSimBackend(detailed_info_for_cascading_failure=False)
you can pass {"backend_cls": LightSimBackend, "backend_options": {"detailed_info_for_cascading_failure": False}}
env_name
: name of the grid2op environment you want to use, by default it uses "l2rpn_case14_sandbox"
env_is_test
: whether to add test=True
when creating the grid2op environment (if env_is_test
is True it will add test=True
when calling grid2op.make(..., test=True)
) otherwise it uses test=False
obs_attr_to_keep
: in this wrapper we only allow your agent to see a Box as an observation. This parameter allows you to control which attributes of the grid2op observation will be present in the agent observation space. By default it's ["rho", "p_or", "gen_p", "load_p"]
which is "kind of random" and is probably not suited for every agent.act_type
: controls the type of actions your agent will be able to perform. Already coded in this notebook are:"discrete"
to use a Discrete
action space"box"
to use a Box
action space"multi_discrete"
to use a MultiDiscrete
action spaceact_attr_to_keep
: that allows you to customize the action space. If not provided, it defaults to:["set_line_status_simple", "set_bus"]
if act_type
is "discrete"
["redispatch", "set_storage", "curtail"]
if act_type
is "box"
["one_line_set", "one_sub_set"]
if act_type
is "multi_discrete"
If you want to add more customization, for example the reward function, the parameters of the environment etc. etc. feel free to get inspired by this code and extend it. Any PR on this regard is more than welcome.
import copy
from typing import Dict, Literal, Any
import json
from gymnasium import Env
from gymnasium.spaces import Discrete, MultiDiscrete, Box
import grid2op
from grid2op.gym_compat import GymEnv, BoxGymObsSpace, DiscreteActSpace, BoxGymActSpace, MultiDiscreteActSpace
from lightsim2grid import LightSimBackend
class Grid2opEnvWrapper(Env):
def __init__(self,
env_config: Dict[Literal["backend_cls",
"backend_options",
"env_name",
"env_is_test",
"obs_attr_to_keep",
"act_type",
"act_attr_to_keep"],
Any] = None):
super().__init__()
if env_config is None:
env_config = {}
# handle the backend
backend_cls = LightSimBackend
if "backend_cls" in env_config:
backend_cls = env_config["backend_cls"]
backend_options = {}
if "backend_options" in env_config:
backend_options = env_config["backend_options"]
backend = backend_cls(**backend_options)
# create the grid2op environment
env_name = "l2rpn_case14_sandbox"
if "env_name" in env_config:
env_name = env_config["env_name"]
if "env_is_test" in env_config:
is_test = bool(env_config["env_is_test"])
else:
is_test = False
self._g2op_env = grid2op.make(env_name, backend=backend, test=is_test)
# NB by default this might be really slow (when the environment is reset)
# see https://grid2op.readthedocs.io/en/latest/data_pipeline.html for maybe 10x speed ups !
# TODO customize reward or action_class for example !
# create the gym env (from grid2op)
self._gym_env = GymEnv(self._g2op_env)
# customize observation space
obs_attr_to_keep = ["rho", "p_or", "gen_p", "load_p"]
if "obs_attr_to_keep" in env_config:
obs_attr_to_keep = copy.deepcopy(env_config["obs_attr_to_keep"])
self._gym_env.observation_space.close()
self._gym_env.observation_space = BoxGymObsSpace(self._g2op_env.observation_space,
attr_to_keep=obs_attr_to_keep
)
# export observation space for the Grid2opEnv
self.observation_space = Box(shape=self._gym_env.observation_space.shape,
low=self._gym_env.observation_space.low,
high=self._gym_env.observation_space.high)
# customize the action space
act_type = "discrete"
if "act_type" in env_config:
act_type = env_config["act_type"]
self._gym_env.action_space.close()
if act_type == "discrete":
# user wants a discrete action space
act_attr_to_keep = ["set_line_status_simple", "set_bus"]
if "act_attr_to_keep" in env_config:
act_attr_to_keep = copy.deepcopy(env_config["act_attr_to_keep"])
self._gym_env.action_space = DiscreteActSpace(self._g2op_env.action_space,
attr_to_keep=act_attr_to_keep)
self.action_space = Discrete(self._gym_env.action_space.n)
elif act_type == "box":
# user wants continuous action space
act_attr_to_keep = ["redispatch", "set_storage", "curtail"]
if "act_attr_to_keep" in env_config:
act_attr_to_keep = copy.deepcopy(env_config["act_attr_to_keep"])
self._gym_env.action_space = BoxGymActSpace(self._g2op_env.action_space,
attr_to_keep=act_attr_to_keep)
self.action_space = Box(shape=self._gym_env.action_space.shape,
low=self._gym_env.action_space.low,
high=self._gym_env.action_space.high)
elif act_type == "multi_discrete":
# user wants a multi-discrete action space
act_attr_to_keep = ["one_line_set", "one_sub_set"]
if "act_attr_to_keep" in env_config:
act_attr_to_keep = copy.deepcopy(env_config["act_attr_to_keep"])
self._gym_env.action_space = MultiDiscreteActSpace(self._g2op_env.action_space,
attr_to_keep=act_attr_to_keep)
self.action_space = MultiDiscrete(self._gym_env.action_space.nvec)
else:
raise NotImplementedError(f"action type '{act_type}' is not currently supported.")
def reset(self, seed=None, options=None):
# use default _gym_env (from grid2op.gym_compat module)
# NB: here you can also specify "default options" when you reset, for example:
# - limiting the duration of the episode "max step"
# - starting at different steps "init ts"
# - study difficult scenario "time serie id"
# - specify an initial state of your grid "init state"
return self._gym_env.reset(seed=seed, options=options)
def step(self, action):
# use default _gym_env (from grid2op.gym_compat module)
return self._gym_env.step(action)
In this section we quickly show :
Grid2opEnvWrapper
defined aboveThis part, for stable baselines is really small.
from stable_baselines3 import PPO
gym_env = Grid2opEnvWrapper()
sb3_algo1 = PPO("MlpPolicy", gym_env, verbose=0)
sb3_algo1.learn(total_timesteps=1024)
This notebook is a simple quick introduction for stable baselines only. So we don't really recall everything that has been said previously.
Please consult the section 0) Recommended initial steps
of the notebook 11_IntegrationWithExistingRLFrameworks for more information.
TLD;DR grid2op offers the possibility to test your agent on scenarios / episodes different from the one it has been trained. We greatly encourage you to use this functionality.
There are two main ways to evaluate your agent:
Runner
to save an inspect what your policy has done.We show here just a simple examples to "get easily started". For much better working agents, you can have a look at l2rpn-baselines code. There you have classes that maps the environment, the agents etc. to grid2op directly (you don't have to copy paste any wrapper).
You can do pretty much what you want, but you have to do it yourself, or use any of the "Wrappers" available in gymnasium https://gymnasium.farama.org/main/api/wrappers/ (eg https://gymnasium.farama.org/main/api/wrappers/misc_wrappers/#gymnasium.wrappers.RecordEpisodeStatistics) or in your RL framework.
For the sake of simplicity, we show how to do things "manually" even though we do not recommend to do it like that.
nb_episode_test = 2
seeds_test_env = (0, 1) # same size as nb_episode_test
seeds_test_agent = (3, 4) # same size as nb_episode_test
ts_ep_test = (0, 1) # same size as nb_episode_test
ep_infos = {} # information that will be saved
for ep_test_num in range(nb_episode_test):
init_obs, init_infos = gym_env.reset(seed=seeds_test_env[ep_test_num],
options={"time serie id": ts_ep_test[ep_test_num]})
sb3_algo1.set_random_seed(seeds_test_agent[ep_test_num])
done = False
cum_reward = 0
step_survived = 0
obs = init_obs
while not done:
act, _states = sb3_algo1.predict(obs, deterministic=True)
obs, reward, terminated, truncated, info = gym_env.step(act)
step_survived += 1
cum_reward += float(reward)
done = terminated or truncated
ep_infos[ep_test_num] = {"time serie id": ts_ep_test[ep_test_num],
"time serie folder": gym_env._gym_env.init_env.chronics_handler.get_id(),
"env seed": seeds_test_env[ep_test_num],
"agent seed": seeds_test_agent[ep_test_num],
"steps survived": step_survived,
"total steps": int(gym_env._gym_env.init_env.max_episode_duration()),
"cum reward": cum_reward}
# "prettyprint" the dictionnary above
print(json.dumps(ep_infos, indent=4))
As you might have seen, it's not easy this way to retrieve some useful information about the grid2op environment if these informations are not passed to the policy.
For example, we need to call gym_env._gym_env.init_env
to access the underlying grid2op environment... You have to convert some things from int32 or float32 to float or int otherwise json complains, you have to control yourself the seeds to have reproducible results etc.
It's a quick way to have something working but it might be perfected.
This second method brings it closer to grid2op ecosystem, you will be able to use it with the grid2op Runner
, save the results and read it back with other tools such as grid2viz and do the evaluation in parrallel without too much trouble (and with high reproducibility).
With this method, you build a grid2op agent and this agent can then be used like every other grid2op agent. For example you can compare it with heuristic agents, agent based on optimization etc.
This way of doing things also allows you to customize when the neural network policy is used. For example, you might chose to use it only when the grid is "unsafe" (and if the grid is safe you use an "expert" rules).
This is more flexible than the previous one.
from grid2op.Agent import BaseAgent
from grid2op.Runner import Runner
class Grid2opAgentWrapper(BaseAgent):
def __init__(self,
gym_env: Grid2opEnvWrapper,
trained_agent: PPO):
self.gym_env = gym_env
BaseAgent.__init__(self, gym_env._gym_env.init_env.action_space)
self.trained_agent = trained_agent
def act(self, obs, reward, done):
# you can customize it here to call the NN policy `trained_agent`
# only in some cases, depending on the observation for example
gym_obs = self.gym_env._gym_env.observation_space.to_gym(obs)
gym_act, _states = self.trained_agent.predict(gym_obs, deterministic=True)
grid2op_act = self.gym_env._gym_env.action_space.from_gym(gym_act)
return grid2op_act
def seed(self, seed):
# implement the seed function
if seed is None:
return
seed_int = int(seed)
if seed_int != seed:
raise RuntimeError("Seed must be convertible to an integer")
self.trained_agent.set_random_seed(seed_int)
my_agent = Grid2opAgentWrapper(gym_env, sb3_algo1)
runner = Runner(**gym_env._g2op_env.get_params_for_runner(),
agentClass=None,
agentInstance=my_agent)
res = runner.run(nb_episode=nb_episode_test,
env_seeds=seeds_test_env,
agent_seeds=seeds_test_agent,
episode_id=ts_ep_test,
add_detailed_output=True
)
res
See the documentation or the notebook 05 StudyYourAgent on how to use grid2op tools to study your agent, its decisions etc.
This, for now, only works on linux based computers. Hopefully this will work on windows and macos as soon as possible.
This allows to use some "parralellism" during the training: your agent will interact "at the same time" with 4 environments allowing it to gather experience faster. But in this case, its training is always done in the "main" process.
from stable_baselines3.common.env_util import make_vec_env
vec_env = make_vec_env(lambda : Grid2opEnvWrapper(), n_envs=4)
sb3_algo2 = PPO("MlpPolicy", vec_env, verbose=0)
sb3_algo2.learn(total_timesteps=1024)
In this third example, we will train a policy using the "box" action space, and on another environment (l2rpn_idf_2023
instead of l2rpn_case14_sandbox
)
# see https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.html
# Use a "Box" action space (mainly to use redispatching, curtailment and storage units)
env_config3 = {"env_name": "l2rpn_idf_2023",
"env_is_test": True,
"act_type": "box",
}
gym_env3 = Grid2opEnvWrapper(env_config3)
sb3_algo3 = PPO("MlpPolicy", gym_env3, verbose=0)
sb3_algo3.learn(total_timesteps=1024)
And now a policy using the "multi discrete" action space:
# see https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.html
# Use a "Box" action space (mainly to use redispatching, curtailment and storage units)
env_config4 = {"env_name": "l2rpn_idf_2023",
"env_is_test": True,
"act_type": "multi_discrete",
}
gym_env4 = Grid2opEnvWrapper(env_config4)
sb3_algo4 = PPO("MlpPolicy", gym_env4, verbose=0)
sb3_algo4.learn(total_timesteps=1024)
This notebook does not aim at covering all possibilities offered by ray / rllib. For that you need to refer to the ray / rllib documentation.
We will simply show how to change the size of the neural network used as a policy.
# see https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.html
gym_env5 = Grid2opEnvWrapper()
sb3_algo5 = PPO("MlpPolicy",
gym_env5,
verbose=0,
policy_kwargs={"net_arch": [32, 32, 32]}
)
sb3_algo5.learn(total_timesteps=1024)