Becoming Human: Artificial Intelligence Magazine – Medium Safe Reinforcement Learning — Part IMany practitioners using reinforcement learning (RL) are often concerned about the safety of SOTA deep RL techniques. With that in mind, many researchers have proposed various formulations and methods (see Sections 1 and 2 of this work). The constrained Markov decision process (cMDP) formulation is one of the most promising directions. This article will follow our recent work on safe RL, introducing MDPs, cMDPS, and state augmentations to determine policies with almost-sure safety guarantees.In this first post, we will detail constrained Markov decision processes and present a simple yet effective way to find feasible policies.Markov Decision Processes:As many already know, basic RL methods adopt Markov decision processes (MDPs) that are defined as the five-tuple in the figure below, where X and A denote the state and action spaces, P is the transition density that dictates the transition to a new state from a current state while applying an action (according to a policy).The fourth component, C, is the cost function (negative reward) — a function of states and actions — that encodes the agent’s goal. At the same time, the last component is a discount factor (typically set to be less than one) used to trade off immediate versus long-term costs (rewards).The goal of an RL agent is to find a policy (an action-selection rule) that minimises the expected total discounted loss.Here, the expectation is taken concerning distributions that encode the problem’s stochasticity, e.g., transition dynamics, policies, etc. Of course, the community focused on solving the minimisation problem above. It proposed various algorithms such as trust-region policy optimisation (TRPO), proximal policy optimisation (PPO), soft-actor critics (SAC), Rainbow, PETS, natural actor-critics (NAC), and many others.Those developments, in turn, led to many successes that are beyond the scope of this post to be listed. Abide those, many problems remain, including but not limited to sample efficiency, simulation availability, robustness, interpretability, stability, and, of course, safety. In what comes next, we will be concerned with the latter and introduce a plug-n-play method capable of ensuring almost sure (with probability one) safety.Constrained Markov Decision Processes (cMDPs):When considering safety, we should ask how to encode such design specificities. Constrained MDPs offer a reliable and rigorous formulation allowing us to introduce constraint (or safety) functions. Simply cMDPs generalise standard MDPs by incorporating an additional function that represents safety considerations and a safety-specific discount factor. From the figure above, the first five components exactly match those of the standard MDP we previously introduced.The additional components, however, are a constraint/safety function and a discount factor that we use to signify that policies must be feasible for design choices formalised through Z.Given a cMDP, our goal now is to find a policy that minimises total cost as before (the standard MDP case) while satisfying safety constraints, as shown in the figure below.Please remember that the minimisation goal remains intact compared to the MDP case. The constraint, on the other hand, is new. The summation term inside the constraint discounts all safety costs over the time horizon. The term d, moreover, is a safety budget that we wish to stay within, i.e., the difference between d and the discounted safety cost should be greater than or equal to zero. With this in mind, please note that:Not any policy minimising the task’s cost is acceptable. We would need those that are feasible with respect to the constraint.One way to solve the problem above is to follow standard constrained optimisation literature by introducing a Lagrange multiplier and rewriting a min-max objective; please take a look at Chapter 5 of this fantastic book for reference. Upon introducing those, we now observe the following min-max problem:We can easily see that, indeed, for every fixed policy, the optimal value of the Lagrange function can be written as:The above equation says that if the safety cost is nonnegative, then set the Lagrange multiplier to 0 — i.e., focus on the task cost at hand. On the other hand, if the safety cost is negative, then maximising over the multiplier yields infinity — i.e., focus on the safety cost as much as possible. For a more detailed analysis, intuition, and formalism, you can just read through the guru Bertsekas’ work from 1997.Just to let you know, we do not need to only consider the expected safety cost as the only form of constraint. Many works have extended the above derivation to, for example, conditional value at risk and proposed actor-critic or policy-based optimisation techniques. If interested in those, make sure to follow us in addition to the works of Shie Mannor, Mohammad Ghavamzadeh, Yinlam Chow, Andreas Krause (safe RL and safe BO), Shimon Whiteson (risk averse RL), Marc Rigter, Marc Deisenroth, Aviv Tamar, Joshua Achiam, Pieter Abbeel, Warren B Powell and Alexander Shapiro.Of course, this list is not exhaustive. If someone else you would like to add here is missing, please feel free to comment and let me know. I will then add them.In the next post, we will detail the Saute-RL case and present a new form of MDPs. Stay tuned!https://becominghuman.ai/Safe Reinforcement Learning — Part I was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story. Read More