Research and Advances
Artificial Intelligence and Machine Learning

Specification-Guided Reinforcement Learning

A tutorial-style introduction to recent research on using logical specifications to encode RL tasks illustrates theoretical limitations and practical solutions.

Posted
robots carrying boxes in a warehouse

In reinforcement learning (RL), an agent learns to achieve its goal by interacting with its environment and learning from feedback about its successes and failures. This feedback is typically encoded as numerical rewards—for example, positive rewards for beneficial outcomes and negative rewards for harmful outcomes. Traditional RL algorithms assume the reward is provided by the user, and designing an effective reward function can be a daunting task. For example, consider a robot organizing boxes in a warehouse. To ensure the robot successfully moves boxes to the right locations while avoiding collisions, we might provide positive rewards for moving boxes in the right directions and negative rewards for collisions. However, there is a wide variety of ways to implement such a reward function: Should rewards be given immediately after moving a box or after multiple successful moves? How should failures such as incorrect placements or collisions be handled? There is also the question of balancing the objective with safety constraints—if the warehouse robot receives rewards solely for obstacle avoidance, it might focus on safety and never move, whereas overweighting the objective may lead it to take dangerous shortcuts. Beyond the question of balancing the objective with safety constraints, the reward function may also need to introduce new variables to keep track of progress—for example, which rooms still have boxes that need to be cleared.

The art of designing an effective reward function is called reward engineering. The challenge is that reward engineering conflates the problem of task specification (i.e., what is the task to be performed?) with the problem of reward shaping (i.e., what rewards best guide the robot to solve the task?). Recent work has proposed to separate these concerns by using logical specifications to encode the desired behavior of an RL agent, which can automatically be compiled into shaped rewards to train the agent. Temporal logics such as linear temporal logic (LTL)20 are particularly suitable for specifying RL tasks, as they provide rigorous syntax and semantics for reasoning about sequences of events and states over time. For instance, instead of crafting numerical rewards, the desired behavior of the warehouse robot can be encoded as a sequence of logical steps: (1) Go to object while avoiding obstacles, (2) Pick up object, (3) Go to target location while avoiding obstacles, and (4) Place object. Each step can be encoded using logical predicates. If g denotes a target location and o denotes an obstacle location, then the requirement of safely visiting location g can be written in LTL as Eventually g Always ¬o, where Eventually and Always are temporal operators.

The key question then is how to train an agent to satisfy a given logical specification. One strategy would be to compile the specification into a shaped reward function, which can then be used in conjunction with a traditional RL algorithm to train an agent. However, compared to a reward function, the logical specification transparently represents task structure in a way that can be exploited by the RL algorithm. In particular, specification-guided RL aims to design novel RL algorithms that train agents directly from logical specifications.

In this article, we provide an overview of some recent progress on specification-guided RL. This is an active area of research, with a growing body of literature.1,5,6,912,14,1719,22,24 We first highlight surprising negative theoretical results that provide limits on specification-guided RL in the infinite horizon setting. Our discussion of these results provides an insight into some key differences between logical specifications and reward functions in the context of RL (curious readers can find formal proofs and further details in the cited references). In practice, it often suffices to require that the agent completes the assigned task within a specified finite number of steps. This finite horizon setting offers opportunities for positive results and practical algorithms. As a representative example, we describe DIRL, an algorithm that combines conventional reward-based RL with high-level planning to efficiently learn policies for complex tasks. Rather than crafting a single reward function for the entire task, DIRL exploits the structure of the logical specification to automatically infer a set of subtasks. Each subtask is associated with its own reward function and represented as an edge in a task graph, which serves as the foundation for high-level planning. DIRL alternates between planning over the task graph and learning subtask policies. This decomposition allows DIRL to tackle complex tasks with significantly fewer environment interactions compared to baseline methods. Through this integration of symbolic structure and learning, DIRL highlights several advantages of specification-guided RL.

Learning Optimal Policies

The objective of RL is to train the agent to act in an environment to accomplish a specific task—for example, learn what action should be performed in any given state so that the robot will eventually reach a specific room. A key feature of RL is the environment is assumed to be unknown. Hence, the policy that enables the agent to act must be learned by exploring the environment.

Markov decision processes.  The environment is modeled as a Markov decision process (MDP). Figure 1a shows an MDP with states S={s0,s1,s2,s3} and actions A={a1,a2}. Actions cause state transitions according to transition probabilities T : S×A×S[0,1]. For example, a1 from s0 leads to s1 or s3 with equal probability. The initial state distribution η describes the distribution of states the system initializes in. Here, η is the dirac distribution at s0—that is, with η(so)=1. Mathematically, an MDP M is a tuple consisting of these four componentsa—that is, M=(S,A,T,η). An infinite run of the MDP is an infinite sequence of state-action pairs describing the evolution of the system over time. A possible infinite run in Figure 1a is ρ=s0a2s2a1s1a2s1a2 in which the agent reaches s1 from s0 via s2. Similarly, a finite run of length H describes the evolution of the system for H time steps—for example, ρ=s0a2s2a1s1a2s1a2s1 is a finite run of length 4.

Figure 1.  Example MDPs.

In RL, T is a priori unknown, and the agent must learn about the environment by taking actions and observing transitions. In Figure 1a, if the agent repeatedly takes action a1 in state s0, it observes that the system transitions to s1 approximately half the time. The goal is to learn a policy π : SA mapping states to actions. This policy induces a distribution over runs, generated by sampling an initial state from η, and iteratively using π to select actions and sample next states from T. We denote the distribution over infinite runs as Dπ and over finite H-length runs as DHπ.

Reward functions.  Traditionally, tasks in RL are specified using a reward function r : S×AR that assigns rewards for state-action pairs. For example, to reach s1 in Figure 1a, we might set r(s,a)=1(s=s1), which assigns a reward of 1 if s=s1 and 0 otherwise. For a run ρ, this naturally generates a reward sequence (e.g., 0,1,1, for a run reaching s1 in one step). For an infinite horizon, RL algorithms typically maximize the expected value of the discounted-sum reward J(π)=E[t=0γtrt] where the random variable rt denotes the reward obtained at time t, and discount factor 0<γ<1 indicates the rate of diminishing returns. In Figure 1a, under the reward function r(s,a)=1(s=s1), the policy π1 that chooses action a1 in state s0 will achieve a discounted-sum reward of 0+γ+γ2+=γ/(1γ) with probability 0.5 (when it reaches s1) and 0 with probability 0.5 (when it reaches s3). Hence, J(π1)=γ/(2(1γ)). However, the policy π2 that chooses action a2 in s0 leads to a discounted-sum reward of 0++0+γk+2+γk+3+=γk+2/(1γ) if s1 is reached in exactly k+2 steps which occurs with probabilityb (1ε)kε. Thus, the expected discounted-sum rewardc is εγ2/(1γ)·k=0(1ε)kγk=εγ2/[(1γ)(1+γεγ)]. If, instead, given a finite horizon H, then RL algorithms maximize the expected reward JH(π)=E[t=0H1rt]. Finally, the goal of RL is to learn an optimal policy π* that maximizes J(π) (or JH(π)); we let J*=supπJ(π) (or J*=supπJH(π)) denote the best possible value.

Logical specifications.  A specification φ is a logical formula encoding whether a finite or infinite run of the MDP solves the desired task. For example, the task of reaching s1 can be encoded by the specification φ=Eventuallys1. A run that starts in s0 and transitions immediately to s3 will never reach s1, and therefore will not satisfy φ. Logical specifications can be straightforwardly composed to express more complex tasks; for instance, the task of sequentially visiting two locations l1 and l2 can be simply expressed as Eventuallyl1; Eventuallyl2. The meaning of a specification is given by the semantics function (denoted φ) that maps a run ρ to {0,1} indicating whether or not the run satisfies φ. The objective of specification-guided RL is to learn a policy that maximizes the probability of achieving the task—that is, Jφ(π)=EρDπ[φ(ρ)]. In Fig 1a, the policy π1 that chooses action a1 in s0 assigns a probability of 0.5 to each of the two infinite runs it produces (one that visits s1 and the other than does not), so Jφ(π1)=0.5. On the other hand, the policy π2 that chooses action a2 in s0 results in infinitely many trajectories but all of these trajectories eventually visit s1 given ε>0. Thus, Jφ(π2)=1. Clearly, π2 is optimal.

Infinite Horizon Specifications

This section discusses results on RL from infinite-horizon specifications. These specifications are important as they can express recurring tasks and persistent behaviors that finite-horizon specifications cannot capture. Linear temporal logic is a prominent example of such specifications. For instance, tasks such as “repeatedly monitor an area” are naturally infinite-horizon properties.

Reductions to Discounted Rewards.  A natural approach to solving an infinite-horizon specification φ is to first compile it into a reward function rφ and then apply standard RL algorithms for discounted rewards that come with guarantees and have state-of-the-art implementations readily available. However, recent work2 has proven that without knowledge of the environment model, there are specifications for which no discounted reward function r exists (for any discount factor γ) such that the policies maximizing r will also maximize the probability of satisfying φ. This impossibility result has profound implications: There are specifications that simply cannot be solved by the strategy of reduction to reward functions. This result arises due to a fundamental mismatch between how discounted rewards and logical specifications measure task completion—discounted rewards inherently care about when the task is completed since future rewards are time-discounted, whereas logical specifications such as Eventuallys1 care only about the task eventually being completed.

This mismatch is illustrated by the specification Eventuallys1 and the reward function r(s,a)=1(s=s1) in Figure 1a. Recall that the policy π2 which takes action a2 in state s0 is optimal with regard to Eventuallys1. However, this policy can be made suboptimal with regard to r(s,a)=1(s=s1) for any discount factor γ by choosing an appropriate value of ε.

Recent work shows some promising directions under additional assumptions. For instance, Bozkurt et al.4 and Hahn et al.9 have shown that any LTL specification can be reduced to discounted rewards with an appropriate discount factor that depends on the MDP transitions. More recently, Alur et al.3 have studied the use of discounted LTL, which naturally prioritizes near-term events over distant ones, allowing a direct reduction to discounted rewards.

Theoretical guarantees.  Even if it is impossible to compile logical specification to discounted-sum rewards, we may still hope that we can devise algorithms that directly learn policies to solve infinite-horizon specifications. Our next result shows that the answer remains negative. To formalize this result, we extend the probably approximately correct (PAC) framework21 to specification-guided RL. Given a confidence level δ>0 and an error tolerance ε>0, an RL algorithm is said to be PAC if after sampling a sufficient number of transitions (as a function of n,m,δ,ε) in an MDP with n states and m actions, the probability that the learned policy π is ε-close to optimal (i.e., J*J(π)ε) is at least 1δ. Note that the required sample size does not depend on the transition probabilities, allowing agents to determine termination without knowing the true environment dynamics.

While algorithms with PAC guarantees exist for discounted-sum rewards, surprisingly, they are impossible for even simple reachability specifications, such as Eventuallyg (where g is a goal state). To understand why, consider the MDP shown in Figure 1b and the specification Eventuallys1. When p1=1 and p2<1, the optimal policy chooses action a2 in state s0 since the probability of eventually reaching s1 from s2 is 1, whereas the probability of reaching s1 from s0 is 0 if the agent chooses action a1 in state s0. Similarly, when p1<1 and p2=1, the optimal action at state s0 is a1. Note that altering the transition probabilities slightly can significantly impact the optimal policy. For example, if p1=1 and p2=1x for an infinitesimally small x, altering the values to p1=p1x=1x and p2=p2+x=1 causes the optimal action at s0 to change from a2 to a1. Furthermore, an optimal policy in the original MDP (with p1=1) that reaches s1 with probability 1 will attain a zero probability of reaching s1 in the new MDP. This example illustrates that the specification Eventuallys1 is not robust—small changes to the MDP can drastically affect the corresponding optimal policy. In contrast, discounted-sum rewards are robust in the sense that altering the transition probabilities by a small amount only impacts the value J(π) of a policy π by a small amount (more precisely, J(π) is a continuous function of the MDP transitions and rewards).

This lack of robustness is the reason PAC algorithms do not exist for specifications. For the sake of contradiction, suppose there is an RL algorithm with a PAC guarantee for Eventuallys1. We can run this algorithm for the above two scenarios with slightly different transition probabilities and the same values of δ>0.9 and ε<1. Then, for a sufficiently small x, it is highly likely that the RL algorithm will see the same samples from the two different MDPs. Thus, the RL algorithm will output the same policy π, meaning it must attain a zero probability of reaching s1 in one of the two MDPs. As a result, J*J(π)=1>ε for that MDP, which is a contradiction to the PAC property.

Furthermore, Yang et al.23 show that PAC guarantees are possible only for finitary specifications (a specification φ is finitary if there exists a fixed H such that we can decide whether or not an infinite run ρ=(s0,s1,) of the system satisfies φ by only looking at the first H steps). PAC algorithms also exist under additional assumptions—for example, if a lower bound on all non-zero transition probabilities of the MDP is given.8

Finite Horizon Specifications

While infinite horizon poses many theoretical challenges for specification-guided RL, in practice, it is often sufficient to consider finitary specifications. If we have a fixed finite horizon H and the satisfaction of a specification φ only depends on the first H steps of the MDP, then a straightforward approach is to define a reward function rφ that assigns a reward of 1 after H steps if φ is satisfied and 0 otherwise. It is easy to see that rφ faithfully represents φ, and a policy π that maximizes the expected sum of rewards JH(π) also maximizes the probability of satisfying φ. However, using such sparse reward functions that “rarely” assign non-zero rewards often requires a large number of samples from the environment. For example, consider the warehouse environment with interconnected rooms shown in Figure 3a. This is an environment with continuous state and action spaces. Here, a state is a pair of coordinates s=(x,y)R2 indicating the location of the robot, and actions are velocities a=(vx,vy)R2. The initial distribution over states is η=Uniform(A) (where A is the purple circle). Consider the specification φ requiring the robot to reach the region B starting from any position in A within H steps. If we use the above approach, then rφ assigns the agent a non-zero reward only after it reaches B. Thus, the agent receives no meaningful reward feedback until it has already completed the entire task. Using state-of-the-art RL algorithms, the agent might explore for a long time before receiving such a reward. This problem is exacerbated for more complex specifications. A common solution is to use dense reward functions that assign non-zero rewards more frequently to guide the agent towards solving the task. For example, we can define a reward function rϕ(s,a)=dist(s,B) measuring the distance between the agent’s current state s and the goal region B, which encourages the robot to move toward B.

Another challenge with encoding specifications using reward functions is that specifications can be non-Markovian or history dependent. However, most existing RL algorithms do not allow rewards to be history dependent—that is, the reward at the current step r(s,a) must be a function of the current state s, and the action a and should be independent of the states visited and the actions taken prior to the current step. For example, suppose φ requires the robot (starting from a state in A) to visit X first, go back to A, and then visit E. In this case, the reward provided to the agent should depend on whether the robot had previously visited X in the current run. If the robot has not yet visited X in the current run, then r(s,a) should be higher if s is in X than if it is in E since the robot should not be encouraged to go directly to E without visiting X. Conversely, if the robot has already visited X, then r(s,a) should be higher if s is in E than if it is in X since the robot should not be encouraged to visit X again or to stay in X indefinitely. While such history-dependent tasks are easy to encode using logical specifications, they cannot be directly represented as rewards.

There has been a lot of recent work on designing practical specification-guided RL algorithms.1,5,6,10,11,18,22,24 Next, we describe one such algorithm called DIRL which is based on the SPECTRL language.

The SPECTRL language.  SPECTRL is a simple specification language that lets users encode tasks such as the ones we discussed in the context of robot navigation.15 Its syntax is given by the grammar

φ := Eventually p | φ Ensuring p | φ ; φ | φ φ

where pS is a predicate representing a subset of the states. In our warehouse example, the region XS=R2 corresponds to such a predicate. Next, Eventuallyp denotes eventually (within H steps) reaching a state sp, and φEnsuringp denotes performing φ while remaining in the “safe” set p at every step. Furthermore, sequencing φ1;φ2 denotes first performing φ1 and then φ2, and disjunction φ1φ2 gives the agent the choice of either performing φ1 or φ2.

In our warehouse example, the task of visiting X while avoiding the obstacle region O (union of all regions marked in red) can be encoded in SPECTRL as φ1=(EventuallyXEnsuring¬O), where ¬O=SO denotes the set of all states that do not belong in O. The task of visiting X first and then returning to A before visiting E can be encoded as φ2=(EventuallyX;EventuallyA;EventuallyE). Finally, φ3=((EventuallyXEventuallyE);EventuallyC)Ensuring¬O specifies that the robot must first visit either X or E and then visit C, all while avoiding O.

A specification-guided RL algorithm.  Early work on specification-guided RL focused on compiling the specification to a reward function,15,18,22 including techniques for reward shaping and handling non-Markovian specifications. One effective strategy for reward shaping is to use quantitative semantics for LTL specifications, which assign numerical values to runs of the system that capture how “robustly” the specification is satisfied (while ensuring runs that satisfy the specification are assigned higher values compared to those that do not). To handle history-dependent tasks, one approach is to alter the state space to include relevant information about the history—for example, as a finite state machine called a reward machine.12 However, some tasks can be challenging to learn with these approaches since even with reward shaping, RL algorithms tend to be myopic, focusing on immediate rewards over long-term ones. Consider our example specification φ3; a typical reward function implementing this task would provide similar rewards for visiting X and E, since these two cannot be distinguished without understanding the structure of the MDP. In the MDP, there is no direct way to reach C from E, so in fact X is preferable to E. Thus, if by chance an RL algorithm learns to solve the initial subtask EventuallyXEventuallyE by visiting E instead of X, then it might struggle to reach C.

We describe a more recent specification-guided RL algorithm based on SPECTRL called Dijkstra guided RL (DIRL)16 that addresses this issue by leveraging the structure of the specification in conjunction with knowledge about the MDP gained during training. DIRL consists of two components: using Dijkstra’s algorithm to search for a high-level sequence of subtasks to perform (a.k.a. planning), and using traditional RL algorithms to learn to perform subtasks (a.k.a. learning). The planning component involves computing a shortest path in a directed graph in which edges represent subtasks. The learning component involves, for each subtask, constructing a reward function and using an off-the-shelf RL algorithm to learn a policy to maximize the expected sum of rewards. Instead of performing learning and then planning, DIRL interleaves the two, incorporating information obtained during learning into planning and vice versa.

Figure 2a provides an overview of the algorithm. First, DIRL automatically constructs an abstract graph Gφ=(V,E) representing the high-level structure in the given SPECTRL specification.d For our example specification φ3, the corresponding abstract graph constructed by DIRL is shown in Figure 2b. Each vertex represents a set of states in the MDP. In this example, the abstract graph has four vertices representing the four regions of our warehouse environment. Each directed edge represents a reachability subtask—namely, the subtask of reaching the set of states represented by the target vertex. For instance, the edge from A to X denotes the subtask of reaching X from A. Each edge is optionally labeled with a path constraint—that is, all edges are labeled with the constraint of avoiding the red obstacle regions. Our example graph has two paths from A to C representing the two ways of solving φ3: (i) first go to X and then go to C, or (ii) first go to E and then go to C. If the agent learns how to perform each subtask along a path in this graph, then the agent can perform the sequence of subtasks dictated by this path to achieve its objective. Thus, the overall problem reduces to two sub-problems corresponding to the two components of DIRL: (i) computing the “best” path in Gφ from the initial vertex v0 to a goal vertex vg (achieved using Dijkstra’s algorithme), and (ii) learning to perform the subtask corresponding to each edge in that path (achieved using RL).

Figure 2.  Components of DIRL.

To apply Dijkstra’s algorithm to compute the “best” path, we need to associate a cost to each edge in the graph; ideally, this assignment should ensure that the resulting shortest path corresponds to the high-level plan most likely to solve φ. For intuition, suppose that the probability of completing a subtask corresponding to an edge e=uvE is independent of the starting state (that is, the specific state in the set of states represented by u), and that the agent has already learned how to complete the subtask corresponding to each edge in the graph. Then, letting pe be the probability of completing the subtask corresponding to edge e, we can define the cost of e to be ce=log(pe). Given a path e1,e2,,e, the probability of successfully completing all corresponding subtasks in order is i=1pei=expi=1log(pei). Thus, minimizing the sum of edge costs (that is, computing the shortest path) is equivalent to maximizing the probability of completing φ.

This strategy could be implemented by first running learning for each subtask, and then running Djikstra’s algorithm, with no feedback between the two. In practice, this approach does not work because our assumption that pe is independent of the starting state does not hold. DIRL uses two ideas to address this issue. First, note we do not need the cost of all edges a priori. Thus, DIRL can estimate edge costs lazily—that is, it estimates the cost ce of edge e only when it is needed by Dijkstra’s algorithm. Whenever an edge cost needs to be estimated, DIRL constructs a (non-sparse) reward function for the corresponding subtask (using quantitative semantics), uses an off-the-shelf RL algorithm to train a policy to perform that subtask and then estimates pe using simulations.

Second, a key property of Dijkstra’s algorithm is that when the cost of an edge e=uv is needed, it has already computed the shortest path from v0 to u; this also implies that DIRL has learned policies to solve subtasks for edges in this path. Thus, we can simulate the environment using these policies to obtain samples of states in u (assuming the agent successfully completes the path). This approach not only provides a way to sample initial states for training a subtask policy for e, but this distribution of initial states corresponds to the distribution of states encountered when using the learned subtask policy if the final path computed by DIRL contains e.

DIRL has been shown to scale well to complex specifications, including realistic applications such as robotic Pick-and-Place environments, where it outperforms prior approaches while requiring fewer than 4×105 samples. Figures 3b3e show learning curves of DIRL for some complex specifications in the warehouse environment. Recent work has expanded SPECTRL-based approaches to multi-agent scenarios7,17 and learning verified policies.13,25

Figure 3.  Performance plots showing probability of specification satisfaction (y-axis) across specifications of increasing complexity against number of samples taken to learn (x-axis). X→Y is shorthand for specification to reach region Y from region X after visiting one of the corners in the square/rectangle formed by X and Y.

Conclusion

In this article, we presented a tutorial-style introduction to recent research on using logical specifications to encode RL tasks. In particular, we discussed both theoretical limitations and practical solutions. Using examples, we illustrated key ideas and concepts such as differences between reward functions and logical specifications, specification robustness, graph representations of logical specifications, and integrating high-level planning with learning low-level control actions. We also demonstrated that one can leverage the structure in the logical specification to improve learning. One main caveat is that the user needs to provide an appropriate specification for the task at hand, and the benefits of specification-guided RL are only fully realized when the task has a natural logical structure—for example, if the task is to simply reach a goal region that is challenging to navigate to (or requires complex maneuvers), specification-guided RL provides limited added benefits when compared to using reward functions since the high-level task is easily encoded using rewards. Nonetheless, specification-guided RL shows tremendous promise for many applications of RL with complex task structures. Several promising directions exist for future work, such as handling raw perceptual inputs including challenges regarding evaluating predicates under partial observability, leveraging LLMs to generate formal specifications from natural language descriptions, and zero-shot generalization to new tasks encoded using logical specifications.

    • 1. Aksaray, D. et al. Q-learning for robust satisfaction of signal temporal logic specifications. In Conf. on Decision and Control (CDC). IEEE, 2016, 65656570.
    • 2. Alur, R., Bansal, S., Bastani, O., and Jothimurugan, K. A Framework for Transforming Specifications in Reinforcement Learning. Principles of Systems Design: Essays Dedicated to Thomas A. Henzinger on the Occasion of His 60th Birthday (2021).
    • 3. Alur, R. et al. Policy Synthesis and Reinforcement Learning for Discounted LTL. In Computer Aided Verification. Springer Nature Switzerland, Cham, 2023, 415435.
    • 4. Bozkurt, A. K., Wang, Y., Zavlanos, M. M., and Pajic, M.  Control synthesis from linear temporal logic specifications using model-free reinforcement learning. In 2020 IEEE Intern. Conf. on Robotics and Automation (ICRA). IEEE, 2020, 1034910355.
    • 5. Brafman, R., Giacomo, G. D., and Patrizi, F. LTLf/LDLf non-Markovian rewards. In Proceedings of the AAAI Conf. on Artificial Intelligence, 32, 2018.
    • 6. Giacomo, G. D., Iocchi, L., Favorito, M., and Patrizi, F.  Foundations for restraining bolts: Reinforcement learning with LTLf/LDLf restraining specifications. In Proceedings of the Intern. Conf. on Automated Planning and Scheduling, 29, 2019, 128136.
    • 7. Eappen, J. and Jagannathan, S. DistSPECTRL: Distributing specifications in multi-agent reinforcement learning systems. In Joint European Conf. on Machine Learning and Knowledge Discovery in Databases. Springer, 2022, 233250.
    • 8. Fu, J. and Topcu, U. Probably Approximately Correct MDP Learning and Control With Temporal Logic Constraints. In Robotics: Science and Systems. 2014.
    • 9. Hahn, E. M. et al. Omega-Regular Objectives in Model-Free Reinforcement Learning. In Tools and Algorithms for the Construction and Analysis of Systems. 2019, 395412.
    • 10. Hasanbeig, M., Abate, A., and Kroening, D. Logically-constrained reinforcement learning. arXiv preprint, arXiv:1801.08099 (2018).
    • 11. Hasanbeig, M. et al. Reinforcement Learning for Temporal Logic Control Synthesis with Probabilistic Satisfaction Guarantees. In Conf. on Decision and Control (CDC). 2019, 53385343.
    • 12. Icarte, R. T., Klassen, T., Valenzano, R., and McIlraith, S. Using reward machines for high-level task specification and decomposition in reinforcement learning. In Intern. Conf. on Machine Learning. PMLR, 2018, 21072116.
    • 13. Ivanov, R. et al. Compositional Learning and Verification of Neural Network Controllers. ACM Transactions on Embedded Computing Systems (2021).
    • 14. Jiang, Y. et al. Temporal-Logic-Based Reward Shaping for Continuing Learning Tasks. arXiv:2007.01498 [cs.AI] 2020.
    • 15. Jothimurugan, K., Alur, R., and Bastani, O. A Composable Specification Language for Reinforcement Learning Tasks. In Advances in Neural Information Processing Systems, 32. 2019, 1304113051.
    • 16. Jothimurugan, K., Bansal, S., Bastani, O., and Alur, R. Compositional Reinforcement Learning from Logical Specifications. In Advances in Neural Information Processing Systems. 2021.
    • 17. Jothimurugan, K., Bansal, S., Bastani, O., and Alur, R. Specification-guided learning of Nash equilibria with high social welfare. In Proc. of the Intern. Conf. on Computer Aided Verification (2022). 343-363.
    • 18. Li, X., Vasile, C. I., and Belta, C. Reinforcement learning with temporal logic rewards. In IEEE/RSJ Intern. Conf. on Intelligent Robots and Systems (IROS). IEEE, 2017, 38343839.
    • 19. Littman, M. L. et al. Environment-Independent Task Specifications via GLTL. arXiv:1704.04341 [cs.AI] 2017.
    • 20. Pnueli, A. The temporal logic of programs. In 18th Annual Symp. on Foundations of Computer Science. IEEE, 1977, 4657.
    • 21. Strehl, A. L. et al. PAC model-free reinforcement learning. In Proceedings of the 23rd Intern. Conf. on Machine Learning. 2006, 881888.
    • 22. Xu, Z. and Topcu, U. Transfer of Temporal Logic Formulas in Reinforcement Learning. In Intern. Joint Conf. on Artificial Intelligence. 2019, 40104018.
    • 23. Yang, C., Littman, M., and Carbin, M. Reinforcement Learning for General LTL Objectives Is Intractable. arXiv preprint, arXiv:2111.12679 (2021).
    • 24. Yuan, L. Z., Hasanbeig, M., Abate, A., and Kroening, D. Modular deep reinforcement learning with temporal logic specifications. arXiv preprint, arXiv:1909.11591 (2019).
    • 25. Žikelić, Ð. et al. Compositional policy learning in stochastic control systems with formal guarantees. Advances in Neural Information Processing Systems 36, (2024).
    • It is common to have a reward function as a component of the MDP, but we intentionally separate the two here.
    • Probability of staying in s2 for k steps times the probability of transitioning to s1 from s2.
    • k 0 ( Probability of reaching in   k + 2   steps ) × ( Discounted-sum reward for reaching in k + 2   steps )
    • We omit the procedure for constructing the abstract graph here, see Jothimurugan16 for details.
    • The abstract graph is always finite and DIRL is applicable in environments with continuous state and action spaces.
    • The views expressed herein are solely the views of the author(s) and are not necessarily the views of Two Sigma Investments, LP or any of its affiliates. They are not intended to provide, and should not be relied upon for, investment advice.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More