#!/usr/bin/env python # coding: utf-8 # # Grid2Op integration with ray / rllib framework # # Try me out interactively with: [![Binder](./img/badge_logo.svg)](https://mybinder.org/v2/gh/Grid2Op/grid2op/master) # # # **objectives** This notebooks briefly explains how to use grid2op with ray (rllib) RL framework. Make sure to read the previous notebook 11_IntegrationWithExistingRLFrameworks.ipynb for a deeper dive into what happens. We only show the working solution here. # # This explains the ideas and shows a "self contained" somewhat minimal example of use of ray / rllib framework with grid2op. It is not meant to be fully generic, code might need to be adjusted. # # This notebook is more an "example of what works" rather than a deep dive tutorial. # # See https://docs.ray.io/en/latest/rllib/rllib-env.html#configuring-environments for a more detailed information. # # See also https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.html for other details # # This notebook is tested with grid2op 1.10.2 and ray 2.24.0 (python3.10) on an ubuntu 20.04 machine. # # We found that ray is highly "unstable". Documentation is not really on par with their developments rythm. Basically, this notebook works given the exact python version and ray version. If you change it then you might need to modify the calls to ray. # # It is organised as followed: # # - [0 Some tips to get started](#0-some-tips-to-get-started) : is a reminder on what you can do to make things work. Indeed, this notebook explains "how to use grid2op with stable baselines" but not "how to create a working agent able to operate a real powergrid in real time with stable baselines". We wish we could explain the later... # - [1 Create the "Grid2opEnvWrapper" class](#1-create-the-grid2openvwraper-class) : explain how to create the main grid2op env class that you can use a "gymnasium" environment. # - [2 Create an environment, and train a first policy](#2-create-an-environment-and-train-a-first-policy): show how to create an environment from the class above (is pretty easy) # - [3 Evaluate the trained agent ](#3-evaluate-the-trained-agent): show how to evaluate the trained "agent" # - [4 Some customizations](#4-some-customizations): explain how to perform some customization of your agent / environment / policy # ## 0 Some tips to get started # # It is unlikely that "simply" using a RL algorithm on a grid2op environment will lead to good results for the vast majority of environments. # # To make RL algorithms work with more or less sucess you might want to: # # 1) ajust the observation space: in particular selecting the right information for your agent. Too much information # and the size of the observation space will blow up and your agent will not learn anything. Not enough # information and your agent will not be able to capture anything. # # 2) customize the action space: dealing with both discrete and continuous values is often a challenge. So maybe you want to focus on only one type of action. And in all cases, try to still reduce the amount of actions your # agent # can perform. Indeed, for "larger" grids (118 substations, as a reference the french grid counts more than 6.000 # such substations...) and by limiting 2 busbars per substation (as a reference, for some subsations, you have more # than 12 such "busbars") your agent will have the opportunity to choose between more than 60.000 different discrete # actions each steps. This is way too large for current RL algorithm as far as we know (and proposed environment are # small in comparison to real one) # # 3) customize the reward: the default reward might not work great for you. Ultimately, what TSO's or ISO's want is # to operate the grid safely, as long as possible with a cost as low as possible. This is of course really hard to # catch everything in one single reward signal. Customizing the reward is also really important because the "do # nothing" policy often leads to really good results (much better than random actions) which makes exploration # different actions...). So you kind of want to incentivize your agent to perform some actions at some point. # # 4) use fast simulator: even if you target an industrial application with industry grade simulators, we still would # advise you to use (at early stage of training at least) fast simulator for the vast majority of the training # process and then maybe to fine tune on better one. # # 5) combine RL with some heuristics: it's super easy to implement things like "if there is no issue, then do # nothing". This can be quite time consuming to learn though. Don't hesitate to check out the "l2rpn-baselines" # repository for already "kind of working" heuristics # # And finally don't hesitate to check solution proposed by winners of past l2rpn competitions in l2rpn-baselines. # # You can also ask question on our discord or on our github. # # ## 1 Create the "Grid2opEnvWrapper" class # # In the next cell, we define a custom environment (that will internally use the `GymEnv` grid2op class). It is not strictly needed # # Indeed, in order to work with ray / rllib you need to define a custom wrapper on top of the GymEnv wrapper. You then have: # # - self._g2op_env which is the default grid2op environment, receiving grid2op Action and producing grid2op Observation. # - self._gym_env which is a the grid2op defined `gymnasium Environment` that cannot be directly used with ray / rllib # - `Grid2opEnvWrapper` which is a the wrapper on top of `self._gym_env` to make it usable with ray / rllib. # # Ray / rllib expects the gymnasium environment to inherit from `gymnasium.Env` and to be initialized with a given configuration. This is why you need to create the `Grid2opEnvWrapper` wrapper on top of `GymEnv`. # # In the initialization of `Grid2opEnvWrapper`, the `env_config` variable is a dictionary that can take as key-word arguments: # # - `backend_cls` : what is the class of the backend. If not provided, it will use `LightSimBackend` from the `lightsim2grid` package # - `backend_options`: what options will be used to create the backend for your environment. Your backend will be created by calling # `backend_cls(**backend_options)`, for example if you want to build `LightSimBackend(detailed_info_for_cascading_failure=False)` you can pass `{"backend_cls": LightSimBackend, "backend_options": {"detailed_info_for_cascading_failure": False}}` # - `env_name` : name of the grid2op environment you want to use, by default it uses `"l2rpn_case14_sandbox"` # - `env_is_test` : whether to add `test=True` when creating the grid2op environment (if `env_is_test` is True it will add `test=True` when calling `grid2op.make(..., test=True)`) otherwise it uses `test=False` # - `obs_attr_to_keep` : in this wrapper we only allow your agent to see a Box as an observation. This parameter allows you to control which attributes of the grid2op observation will be present in the agent observation space. By default it's `["rho", "p_or", "gen_p", "load_p"]` which is "kind of random" and is probably not suited for every agent. # - `act_type` : controls the type of actions your agent will be able to perform. Already coded in this notebook are: # - `"discrete"` to use a `Discrete` action space # - `"box"` to use a `Box` action space # - `"multi_discrete"` to use a `MultiDiscrete` action space # - `act_attr_to_keep` : that allows you to customize the action space. If not provided, it defaults to: # - `["set_line_status_simple", "set_bus"]` if `act_type` is `"discrete"` # - `["redispatch", "set_storage", "curtail"]` if `act_type` is `"box"` # - `["one_line_set", "one_sub_set"]` if `act_type` is `"multi_discrete"` # # If you want to add more customization, for example the reward function, the parameters of the environment etc. etc. feel free to get inspired by this code and extend it. Any PR on this regard is more than welcome. # In[ ]: from gymnasium import Env from gymnasium.spaces import Discrete, MultiDiscrete, Box import json import ray from ray.rllib.algorithms.ppo import PPOConfig from ray.rllib.algorithms import ppo from typing import Dict, Literal, Any import copy import grid2op from grid2op.gym_compat import GymEnv, BoxGymObsSpace, DiscreteActSpace, BoxGymActSpace, MultiDiscreteActSpace from lightsim2grid import LightSimBackend class Grid2opEnvWrapper(Env): def __init__(self, env_config: Dict[Literal["backend_cls", "backend_options", "env_name", "env_is_test", "obs_attr_to_keep", "act_type", "act_attr_to_keep"], Any]= None): super().__init__() if env_config is None: env_config = {} # handle the backend backend_cls = LightSimBackend if "backend_cls" in env_config: backend_cls = env_config["backend_cls"] backend_options = {} if "backend_options" in env_config: backend_options = env_config["backend_options"] backend = backend_cls(**backend_options) # create the grid2op environment env_name = "l2rpn_case14_sandbox" if "env_name" in env_config: env_name = env_config["env_name"] if "env_is_test" in env_config: is_test = bool(env_config["env_is_test"]) else: is_test = False self._g2op_env = grid2op.make(env_name, backend=backend, test=is_test) # NB by default this might be really slow (when the environment is reset) # see https://grid2op.readthedocs.io/en/latest/data_pipeline.html for maybe 10x speed ups ! # TODO customize reward or action_class for example ! # create the gym env (from grid2op) self._gym_env = GymEnv(self._g2op_env) # customize observation space obs_attr_to_keep = ["rho", "p_or", "gen_p", "load_p"] if "obs_attr_to_keep" in env_config: obs_attr_to_keep = copy.deepcopy(env_config["obs_attr_to_keep"]) self._gym_env.observation_space.close() self._gym_env.observation_space = BoxGymObsSpace(self._g2op_env.observation_space, attr_to_keep=obs_attr_to_keep ) # export observation space for the Grid2opEnv self.observation_space = Box(shape=self._gym_env.observation_space.shape, low=self._gym_env.observation_space.low, high=self._gym_env.observation_space.high) # customize the action space act_type = "discrete" if "act_type" in env_config: act_type = env_config["act_type"] self._gym_env.action_space.close() if act_type == "discrete": # user wants a discrete action space act_attr_to_keep = ["set_line_status_simple", "set_bus"] if "act_attr_to_keep" in env_config: act_attr_to_keep = copy.deepcopy(env_config["act_attr_to_keep"]) self._gym_env.action_space = DiscreteActSpace(self._g2op_env.action_space, attr_to_keep=act_attr_to_keep) self.action_space = Discrete(self._gym_env.action_space.n) elif act_type == "box": # user wants continuous action space act_attr_to_keep = ["redispatch", "set_storage", "curtail"] if "act_attr_to_keep" in env_config: act_attr_to_keep = copy.deepcopy(env_config["act_attr_to_keep"]) self._gym_env.action_space = BoxGymActSpace(self._g2op_env.action_space, attr_to_keep=act_attr_to_keep) self.action_space = Box(shape=self._gym_env.action_space.shape, low=self._gym_env.action_space.low, high=self._gym_env.action_space.high) elif act_type == "multi_discrete": # user wants a multi-discrete action space act_attr_to_keep = ["one_line_set", "one_sub_set"] if "act_attr_to_keep" in env_config: act_attr_to_keep = copy.deepcopy(env_config["act_attr_to_keep"]) self._gym_env.action_space = MultiDiscreteActSpace(self._g2op_env.action_space, attr_to_keep=act_attr_to_keep) self.action_space = MultiDiscrete(self._gym_env.action_space.nvec) else: raise NotImplementedError(f"action type '{act_type}' is not currently supported.") def reset(self, seed=None, options=None): # use default _gym_env (from grid2op.gym_compat module) # NB: here you can also specify "default options" when you reset, for example: # - limiting the duration of the episode "max step" # - starting at different steps "init ts" # - study difficult scenario "time serie id" # - specify an initial state of your grid "init state" return self._gym_env.reset(seed=seed, options=options) def step(self, action): # use default _gym_env (from grid2op.gym_compat module) return self._gym_env.step(action) # ## 2 Create an environment, and train a first policy # Now we init ray, because we need to. # In[ ]: ray.init() # In[ ]: # example of the documentation, directly # see https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.html # Construct a generic config object, specifying values within different # sub-categories, e.g. "training". env_config = {} config = (PPOConfig().training(gamma=0.9, lr=0.01) .environment(env=Grid2opEnvWrapper, env_config=env_config) .resources(num_gpus=0) .env_runners(num_env_runners=0) .framework("tf2") ) # A config object can be used to construct the respective Algorithm. rllib_algo = config.build() # Now we train it for one training iteration (might call `env.reset()` and `env.step()` multiple times, see ray's documentation for a better understanding of what happens here and don't hesitate to open an issue or a PR to explain it and we'll add it here, thanks) # In[ ]: print(rllib_algo.train()) # ## 3 Evaluate the trained agent # # This notebook is a simple quick introduction for stable baselines only. So we don't really recall everything that has been said previously. # # Please consult the section `0) Recommended initial steps` of the notebook [11_IntegrationWithExistingRLFrameworks](./11_IntegrationWithExistingRLFrameworks.ipynb) for more information. # # **TLD;DR** grid2op offers the possibility to test your agent on scenarios / episodes different from the one it has been trained. We greatly encourage you to use this functionality. # # There are two main ways to evaluate your agent: # # - you stay in the "gymnasium" world (see [here](#31-staying-in-the-gymnasium-ecosystem) ) and you evaluate your policy directly just like you would any other gymnasium compatible environment. Simple, easy but without support for some grid2op features # - you "get back" to the "grid2op" world (detailed [here](#32-using-the-grid2op-ecosystem)) by "converting" your NN policy into something that is able to output grid2op like action. This introduces yet again a "wrapper" but you can benefit from all grid2op features, such as the `Runner` to save an inspect what your policy has done. # # We show here just a simple examples to "get easily started". For much better working agents, you can have a look at l2rpn-baselines code. There you have classes that maps the environment, the agents etc. to grid2op directly (you don't have to copy paste any wrapper). # # # # ### 3.1 staying in the gymnasium ecosystem # # You can do pretty much what you want, but you have to do it yourself, or use any of the "Wrappers" available in gymnasium https://gymnasium.farama.org/main/api/wrappers/ (*eg* https://gymnasium.farama.org/main/api/wrappers/misc_wrappers/#gymnasium.wrappers.RecordEpisodeStatistics) or in your RL framework. # # For the sake of simplicity, we show how to do things "manually" even though we do not recommend to do it like that. # In[ ]: nb_episode_test = 2 seeds_test_env = (0, 1) # same size as nb_episode_test seeds_test_agent = (3, 4) # same size as nb_episode_test ts_ep_test = (0, 1) # same size as nb_episode_test gym_env = Grid2opEnvWrapper(env_config) # In[ ]: ep_infos = {} # information that will be saved for ep_test_num in range(nb_episode_test): init_obs, init_infos = gym_env.reset(seed=seeds_test_env[ep_test_num], options={"time serie id": ts_ep_test[ep_test_num]}) # TODO seed the agent, I did not found in ray doc how to do it done = False cum_reward = 0 step_survived = 0 obs = init_obs while not done: act = rllib_algo.compute_single_action(obs, explore=False) obs, reward, terminated, truncated, info = gym_env.step(act) step_survived += 1 cum_reward += float(reward) done = terminated or truncated ep_infos[ep_test_num] = {"time serie id": ts_ep_test[ep_test_num], "time serie folder": gym_env._gym_env.init_env.chronics_handler.get_id(), "env seed": seeds_test_env[ep_test_num], "agent seed": seeds_test_agent[ep_test_num], "steps survived": step_survived, "total steps": int(gym_env._gym_env.init_env.max_episode_duration()), "cum reward": cum_reward} # In[ ]: # "prettyprint" the dictionnary above print(json.dumps(ep_infos, indent=4)) # As you might have seen, it's not easy this way to retrieve some useful information about the grid2op environment if these informations are not passed to the policy. # # For example, we need to call `gym_env._gym_env.init_env` to access the underlying grid2op environment... You have to convert some things from int32 or float32 to float or int otherwise json complains, you have to control yourself the seeds to have reproducible results etc. # # It's a quick way to have something working but it might be perfected. # ### 3.2 using the grid2op ecosystem # # This second method brings it closer to grid2op ecosystem, you will be able to use it with the grid2op `Runner`, save the results and read it back with other tools such as grid2viz and do the evaluation in parrallel without too much trouble (and with high reproducibility). # # With this method, you build a grid2op agent and this agent can then be used like every other grid2op agent. For example you can compare it with heuristic agents, agent based on optimization etc. # # This way of doing things also allows you to customize when the neural network policy is used. For example, you might chose to use it only when the grid is "unsafe" (and if the grid is safe you use an "expert" rules). # # This is more flexible than the previous one. # In[ ]: from grid2op.Agent import BaseAgent from grid2op.Runner import Runner class Grid2opAgentWrapper(BaseAgent): def __init__(self, gym_env: Grid2opEnvWrapper, trained_agent): self.gym_env = gym_env BaseAgent.__init__(self, gym_env._gym_env.init_env.action_space) self.trained_agent = trained_agent def act(self, obs, reward, done): # you can customize it here to call the NN policy `trained_agent` # only in some cases, depending on the observation for example gym_obs = self.gym_env._gym_env.observation_space.to_gym(obs) gym_act = self.trained_agent.compute_single_action(gym_obs, explore=False) grid2op_act = self.gym_env._gym_env.action_space.from_gym(gym_act) return grid2op_act def seed(self, seed): # implement the seed function # TODO return # In[ ]: my_agent = Grid2opAgentWrapper(gym_env, rllib_algo) runner = Runner(**gym_env._g2op_env.get_params_for_runner(), agentClass=None, agentInstance=my_agent) # In[ ]: res = runner.run(nb_episode=nb_episode_test, env_seeds=seeds_test_env, agent_seeds=seeds_test_agent, episode_id=ts_ep_test, add_detailed_output=True ) # In[ ]: res # ## 4 some customizations # # ### 4.1 Train a PPO agent using 2 "runners" to make the rollouts # # In this second example, we explain briefly how to train the model using 2 "processes". This is, the agent will interact with 2 agents at the same time during the "rollout" phases. # # But everything related to the training of the agent is still done on the main process (and in this case not using a GPU but only a CPU). # In[ ]: # see https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.html # use multiple runners config2 = (PPOConfig().training(gamma=0.9, lr=0.01) .environment(env=Grid2opEnvWrapper, env_config={}) .resources(num_gpus=0) .env_runners(num_env_runners=2, num_envs_per_env_runner=1, num_cpus_per_env_runner=1) .framework("tf2") ) # A config object can be used to construct the respective Algorithm. rllib_algo2 = config2.build() # Now we train it for one training iteration (might call `env.reset()` and `env.step()` multiple times) # In[ ]: print(rllib_algo2.train()) # ### 4.2 Use non default parameters to make the grid2op environment # # In this third example, we will train a policy using the "box" action space, and on another environment (`l2rpn_idf_2023` instead of `l2rpn_case14_sandbox`) # In[ ]: # see https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.html # Use a "Box" action space (mainly to use redispatching, curtailment and storage units) env_config3 = {"env_name": "l2rpn_idf_2023", "env_is_test": True, "act_type": "box", } config3 = (PPOConfig().training(gamma=0.9, lr=0.01) .environment(env=Grid2opEnvWrapper, env_config=env_config3) .resources(num_gpus=0) .env_runners(num_env_runners=2, num_envs_per_env_runner=1, num_cpus_per_env_runner=1) .framework("tf2") ) # A config object can be used to construct the respective Algorithm. rllib_algo3 = config3.build() # Now we train it for one training iteration (might call `env.reset()` and `env.step()` multiple times) # In[ ]: print(rllib_algo3.train()) # And now a policy using the "multi discrete" action space: # In[ ]: # see https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.html # Use a "Box" action space (mainly to use redispatching, curtailment and storage units) env_config4 = {"env_name": "l2rpn_idf_2023", "env_is_test": True, "act_type": "multi_discrete", } config4 = (PPOConfig().training(gamma=0.9, lr=0.01) .environment(env=Grid2opEnvWrapper, env_config=env_config4) .resources(num_gpus=0) .env_runners(num_env_runners=2, num_envs_per_env_runner=1, num_cpus_per_env_runner=1) .framework("tf2") ) # A config object can be used to construct the respective Algorithm. rllib_algo4 = config4.build() # Now we train it for one training iteration (might call `env.reset()` and `env.step()` multiple times) # In[ ]: print(rllib_algo4.train()) # ### 4.3 Customize the policy (number of layers, size of layers etc.) # # This notebook does not aim at covering all possibilities offered by ray / rllib. For that you need to refer to the ray / rllib documentation. # # We will simply show how to change the size of the neural network used as a policy. # In[ ]: # see https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.html # Use a "Box" action space (mainly to use redispatching, curtailment and storage units) config5 = (PPOConfig().training(gamma=0.9, lr=0.01) .environment(env=Grid2opEnvWrapper, env_config={}) .resources(num_gpus=0) .env_runners(num_env_runners=2, num_envs_per_env_runner=1, num_cpus_per_env_runner=1) .framework("tf2") .rl_module( model_config_dict={"fcnet_hiddens": [32, 32, 32]}, # 3 layers (fully connected) of 32 units each ) ) # A config object can be used to construct the respective Algorithm. rllib_algo5 = config5.build() # Now we train it for one training iteration (might call `env.reset()` and `env.step()` multiple times) # In[ ]: print(rllib_algo5.train())