In reinforcement learning (RL), an agent learns to achieve its goal by interacting with its environment and learning from feedback about its successes and failures. This feedback is typically encoded as numerical rewards—for example, positive rewards for beneficial outcomes and negative rewards for harmful outcomes. Traditional RL algorithms assume the reward is provided by the user, and designing an effective reward function can be a daunting task. For example, consider a robot organizing boxes in a warehouse. To ensure the robot successfully moves boxes to the right locations while avoiding collisions, we might provide positive rewards for moving boxes in the right directions and negative rewards for collisions. However, there is a wide variety of ways to implement such a reward function: Should rewards be given immediately after moving a box or after multiple successful moves? How should failures such as incorrect placements or collisions be handled? There is also the question of balancing the objective with safety constraints—if the warehouse robot receives rewards solely for obstacle avoidance, it might focus on safety and never move, whereas overweighting the objective may lead it to take dangerous shortcuts. Beyond the question of balancing the objective with safety constraints, the reward function may also need to introduce new variables to keep track of progress—for example, which rooms still have boxes that need to be cleared.
The art of designing an effective reward function is called reward engineering. The challenge is that reward engineering conflates the problem of task specification (i.e., what is the task to be performed?) with the problem of reward shaping (i.e., what rewards best guide the robot to solve the task?). Recent work has proposed to separate these concerns by using logical specifications to encode the desired behavior of an RL agent, which can automatically be compiled into shaped rewards to train the agent. Temporal logics such as linear temporal logic (LTL)20 are particularly suitable for specifying RL tasks, as they provide rigorous syntax and semantics for reasoning about sequences of events and states over time. For instance, instead of crafting numerical rewards, the desired behavior of the warehouse robot can be encoded as a sequence of logical steps: (1) Go to object while avoiding obstacles, (2) Pick up object, (3) Go to target location while avoiding obstacles, and (4) Place object. Each step can be encoded using logical predicates. If denotes a target location and denotes an obstacle location, then the requirement of safely visiting location can be written in LTL as Eventually Always , where Eventually and Always are temporal operators.
The key question then is how to train an agent to satisfy a given logical specification. One strategy would be to compile the specification into a shaped reward function, which can then be used in conjunction with a traditional RL algorithm to train an agent. However, compared to a reward function, the logical specification transparently represents task structure in a way that can be exploited by the RL algorithm. In particular, specification-guided RL aims to design novel RL algorithms that train agents directly from logical specifications.
In this article, we provide an overview of some recent progress on specification-guided RL. This is an active area of research, with a growing body of literature.1,5,6,9–12,14,17–19,22,24 We first highlight surprising negative theoretical results that provide limits on specification-guided RL in the infinite horizon setting. Our discussion of these results provides an insight into some key differences between logical specifications and reward functions in the context of RL (curious readers can find formal proofs and further details in the cited references). In practice, it often suffices to require that the agent completes the assigned task within a specified finite number of steps. This finite horizon setting offers opportunities for positive results and practical algorithms. As a representative example, we describe DIRL, an algorithm that combines conventional reward-based RL with high-level planning to efficiently learn policies for complex tasks. Rather than crafting a single reward function for the entire task, DIRL exploits the structure of the logical specification to automatically infer a set of subtasks. Each subtask is associated with its own reward function and represented as an edge in a task graph, which serves as the foundation for high-level planning. DIRL alternates between planning over the task graph and learning subtask policies. This decomposition allows DIRL to tackle complex tasks with significantly fewer environment interactions compared to baseline methods. Through this integration of symbolic structure and learning, DIRL highlights several advantages of specification-guided RL.
Learning Optimal Policies
The objective of RL is to train the agent to act in an environment to accomplish a specific task—for example, learn what action should be performed in any given state so that the robot will eventually reach a specific room. A key feature of RL is the environment is assumed to be unknown. Hence, the policy that enables the agent to act must be learned by exploring the environment.
Markov decision processes. The environment is modeled as a Markov decision process (MDP). Figure 1a shows an MDP with states and actions . Actions cause state transitions according to transition probabilities . For example, from leads to or with equal probability. The initial state distribution describes the distribution of states the system initializes in. Here, is the dirac distribution at —that is, with . Mathematically, an MDP is a tuple consisting of these four componentsa—that is, . An infinite run of the MDP is an infinite sequence of state-action pairs describing the evolution of the system over time. A possible infinite run in Figure 1a is in which the agent reaches from via . Similarly, a finite run of length describes the evolution of the system for time steps—for example, is a finite run of length 4.
In RL, is a priori unknown, and the agent must learn about the environment by taking actions and observing transitions. In Figure 1a, if the agent repeatedly takes action in state , it observes that the system transitions to approximately half the time. The goal is to learn a policy mapping states to actions. This policy induces a distribution over runs, generated by sampling an initial state from , and iteratively using to select actions and sample next states from . We denote the distribution over infinite runs as and over finite -length runs as .
Reward functions. Traditionally, tasks in RL are specified using a reward function that assigns rewards for state-action pairs. For example, to reach in Figure 1a, we might set , which assigns a reward of 1 if and 0 otherwise. For a run , this naturally generates a reward sequence (e.g., for a run reaching in one step). For an infinite horizon, RL algorithms typically maximize the expected value of the discounted-sum reward where the random variable denotes the reward obtained at time , and discount factor indicates the rate of diminishing returns. In Figure 1a, under the reward function , the policy that chooses action in state will achieve a discounted-sum reward of with probability 0.5 (when it reaches ) and 0 with probability 0.5 (when it reaches ). Hence, . However, the policy that chooses action in leads to a discounted-sum reward of if is reached in exactly steps which occurs with probabilityb . Thus, the expected discounted-sum rewardc is . If, instead, given a finite horizon , then RL algorithms maximize the expected reward . Finally, the goal of RL is to learn an optimal policy that maximizes (or ); we let (or ) denote the best possible value.
Logical specifications. A specification is a logical formula encoding whether a finite or infinite run of the MDP solves the desired task. For example, the task of reaching can be encoded by the specification . A run that starts in and transitions immediately to will never reach , and therefore will not satisfy . Logical specifications can be straightforwardly composed to express more complex tasks; for instance, the task of sequentially visiting two locations and can be simply expressed as ; . The meaning of a specification is given by the semantics function (denoted ) that maps a run to indicating whether or not the run satisfies . The objective of specification-guided RL is to learn a policy that maximizes the probability of achieving the task—that is, . In Fig 1a, the policy that chooses action in assigns a probability of 0.5 to each of the two infinite runs it produces (one that visits and the other than does not), so . On the other hand, the policy that chooses action in results in infinitely many trajectories but all of these trajectories eventually visit given . Thus, . Clearly, is optimal.
Infinite Horizon Specifications
This section discusses results on RL from infinite-horizon specifications. These specifications are important as they can express recurring tasks and persistent behaviors that finite-horizon specifications cannot capture. Linear temporal logic is a prominent example of such specifications. For instance, tasks such as “repeatedly monitor an area” are naturally infinite-horizon properties.
Reductions to Discounted Rewards. A natural approach to solving an infinite-horizon specification is to first compile it into a reward function and then apply standard RL algorithms for discounted rewards that come with guarantees and have state-of-the-art implementations readily available. However, recent work2 has proven that without knowledge of the environment model, there are specifications for which no discounted reward function exists (for any discount factor ) such that the policies maximizing will also maximize the probability of satisfying . This impossibility result has profound implications: There are specifications that simply cannot be solved by the strategy of reduction to reward functions. This result arises due to a fundamental mismatch between how discounted rewards and logical specifications measure task completion—discounted rewards inherently care about when the task is completed since future rewards are time-discounted, whereas logical specifications such as care only about the task eventually being completed.
This mismatch is illustrated by the specification and the reward function in Figure 1a. Recall that the policy which takes action in state is optimal with regard to . However, this policy can be made suboptimal with regard to for any discount factor by choosing an appropriate value of .
Recent work shows some promising directions under additional assumptions. For instance, Bozkurt et al.4 and Hahn et al.9 have shown that any LTL specification can be reduced to discounted rewards with an appropriate discount factor that depends on the MDP transitions. More recently, Alur et al.3 have studied the use of discounted LTL, which naturally prioritizes near-term events over distant ones, allowing a direct reduction to discounted rewards.
Theoretical guarantees. Even if it is impossible to compile logical specification to discounted-sum rewards, we may still hope that we can devise algorithms that directly learn policies to solve infinite-horizon specifications. Our next result shows that the answer remains negative. To formalize this result, we extend the probably approximately correct (PAC) framework21 to specification-guided RL. Given a confidence level and an error tolerance , an RL algorithm is said to be PAC if after sampling a sufficient number of transitions (as a function of ) in an MDP with states and actions, the probability that the learned policy is -close to optimal (i.e., ) is at least . Note that the required sample size does not depend on the transition probabilities, allowing agents to determine termination without knowing the true environment dynamics.
While algorithms with PAC guarantees exist for discounted-sum rewards, surprisingly, they are impossible for even simple reachability specifications, such as (where is a goal state). To understand why, consider the MDP shown in Figure 1b and the specification . When and , the optimal policy chooses action in state since the probability of eventually reaching from is 1, whereas the probability of reaching from is 0 if the agent chooses action in state . Similarly, when and , the optimal action at state is . Note that altering the transition probabilities slightly can significantly impact the optimal policy. For example, if and for an infinitesimally small , altering the values to and causes the optimal action at to change from to . Furthermore, an optimal policy in the original MDP (with ) that reaches with probability 1 will attain a zero probability of reaching in the new MDP. This example illustrates that the specification is not robust—small changes to the MDP can drastically affect the corresponding optimal policy. In contrast, discounted-sum rewards are robust in the sense that altering the transition probabilities by a small amount only impacts the value of a policy by a small amount (more precisely, is a continuous function of the MDP transitions and rewards).
This lack of robustness is the reason PAC algorithms do not exist for specifications. For the sake of contradiction, suppose there is an RL algorithm with a PAC guarantee for . We can run this algorithm for the above two scenarios with slightly different transition probabilities and the same values of and . Then, for a sufficiently small , it is highly likely that the RL algorithm will see the same samples from the two different MDPs. Thus, the RL algorithm will output the same policy , meaning it must attain a zero probability of reaching in one of the two MDPs. As a result, for that MDP, which is a contradiction to the PAC property.
Furthermore, Yang et al.23 show that PAC guarantees are possible only for finitary specifications (a specification is finitary if there exists a fixed such that we can decide whether or not an infinite run of the system satisfies by only looking at the first steps). PAC algorithms also exist under additional assumptions—for example, if a lower bound on all non-zero transition probabilities of the MDP is given.8
Finite Horizon Specifications
While infinite horizon poses many theoretical challenges for specification-guided RL, in practice, it is often sufficient to consider finitary specifications. If we have a fixed finite horizon and the satisfaction of a specification only depends on the first steps of the MDP, then a straightforward approach is to define a reward function that assigns a reward of 1 after steps if is satisfied and 0 otherwise. It is easy to see that faithfully represents , and a policy that maximizes the expected sum of rewards also maximizes the probability of satisfying . However, using such sparse reward functions that “rarely” assign non-zero rewards often requires a large number of samples from the environment. For example, consider the warehouse environment with interconnected rooms shown in Figure 3a. This is an environment with continuous state and action spaces. Here, a state is a pair of coordinates indicating the location of the robot, and actions are velocities . The initial distribution over states is (where is the purple circle). Consider the specification requiring the robot to reach the region starting from any position in within steps. If we use the above approach, then assigns the agent a non-zero reward only after it reaches . Thus, the agent receives no meaningful reward feedback until it has already completed the entire task. Using state-of-the-art RL algorithms, the agent might explore for a long time before receiving such a reward. This problem is exacerbated for more complex specifications. A common solution is to use dense reward functions that assign non-zero rewards more frequently to guide the agent towards solving the task. For example, we can define a reward function measuring the distance between the agent’s current state and the goal region , which encourages the robot to move toward .
Another challenge with encoding specifications using reward functions is that specifications can be non-Markovian or history dependent. However, most existing RL algorithms do not allow rewards to be history dependent—that is, the reward at the current step must be a function of the current state , and the action and should be independent of the states visited and the actions taken prior to the current step. For example, suppose requires the robot (starting from a state in ) to visit first, go back to , and then visit . In this case, the reward provided to the agent should depend on whether the robot had previously visited in the current run. If the robot has not yet visited in the current run, then should be higher if is in than if it is in since the robot should not be encouraged to go directly to without visiting . Conversely, if the robot has already visited , then should be higher if is in than if it is in since the robot should not be encouraged to visit again or to stay in indefinitely. While such history-dependent tasks are easy to encode using logical specifications, they cannot be directly represented as rewards.
There has been a lot of recent work on designing practical specification-guided RL algorithms.1,5,6,10,11,18,22,24 Next, we describe one such algorithm called DIRL which is based on the SPECTRL language.
The SPECTRL language. SPECTRL is a simple specification language that lets users encode tasks such as the ones we discussed in the context of robot navigation.15 Its syntax is given by the grammar
where is a predicate representing a subset of the states. In our warehouse example, the region corresponds to such a predicate. Next, denotes eventually (within steps) reaching a state , and denotes performing while remaining in the “safe” set at every step. Furthermore, sequencing denotes first performing and then , and disjunction gives the agent the choice of either performing or .
In our warehouse example, the task of visiting while avoiding the obstacle region (union of all regions marked in red) can be encoded in SPECTRL as , where denotes the set of all states that do not belong in . The task of visiting first and then returning to before visiting can be encoded as . Finally, specifies that the robot must first visit either or and then visit , all while avoiding .
A specification-guided RL algorithm. Early work on specification-guided RL focused on compiling the specification to a reward function,15,18,22 including techniques for reward shaping and handling non-Markovian specifications. One effective strategy for reward shaping is to use quantitative semantics for LTL specifications, which assign numerical values to runs of the system that capture how “robustly” the specification is satisfied (while ensuring runs that satisfy the specification are assigned higher values compared to those that do not). To handle history-dependent tasks, one approach is to alter the state space to include relevant information about the history—for example, as a finite state machine called a reward machine.12 However, some tasks can be challenging to learn with these approaches since even with reward shaping, RL algorithms tend to be myopic, focusing on immediate rewards over long-term ones. Consider our example specification ; a typical reward function implementing this task would provide similar rewards for visiting and , since these two cannot be distinguished without understanding the structure of the MDP. In the MDP, there is no direct way to reach from , so in fact is preferable to . Thus, if by chance an RL algorithm learns to solve the initial subtask by visiting instead of , then it might struggle to reach .
We describe a more recent specification-guided RL algorithm based on SPECTRL called Dijkstra guided RL (DIRL)16 that addresses this issue by leveraging the structure of the specification in conjunction with knowledge about the MDP gained during training. DIRL consists of two components: using Dijkstra’s algorithm to search for a high-level sequence of subtasks to perform (a.k.a. planning), and using traditional RL algorithms to learn to perform subtasks (a.k.a. learning). The planning component involves computing a shortest path in a directed graph in which edges represent subtasks. The learning component involves, for each subtask, constructing a reward function and using an off-the-shelf RL algorithm to learn a policy to maximize the expected sum of rewards. Instead of performing learning and then planning, DIRL interleaves the two, incorporating information obtained during learning into planning and vice versa.
Figure 2a provides an overview of the algorithm. First, DIRL automatically constructs an abstract graph representing the high-level structure in the given SPECTRL specification.d For our example specification , the corresponding abstract graph constructed by DIRL is shown in Figure 2b. Each vertex represents a set of states in the MDP. In this example, the abstract graph has four vertices representing the four regions of our warehouse environment. Each directed edge represents a reachability subtask—namely, the subtask of reaching the set of states represented by the target vertex. For instance, the edge from to denotes the subtask of reaching from . Each edge is optionally labeled with a path constraint—that is, all edges are labeled with the constraint of avoiding the red obstacle regions. Our example graph has two paths from to representing the two ways of solving : (i) first go to and then go to , or (ii) first go to and then go to . If the agent learns how to perform each subtask along a path in this graph, then the agent can perform the sequence of subtasks dictated by this path to achieve its objective. Thus, the overall problem reduces to two sub-problems corresponding to the two components of DIRL: (i) computing the “best” path in from the initial vertex to a goal vertex (achieved using Dijkstra’s algorithme), and (ii) learning to perform the subtask corresponding to each edge in that path (achieved using RL).
To apply Dijkstra’s algorithm to compute the “best” path, we need to associate a cost to each edge in the graph; ideally, this assignment should ensure that the resulting shortest path corresponds to the high-level plan most likely to solve . For intuition, suppose that the probability of completing a subtask corresponding to an edge is independent of the starting state (that is, the specific state in the set of states represented by ), and that the agent has already learned how to complete the subtask corresponding to each edge in the graph. Then, letting be the probability of completing the subtask corresponding to edge , we can define the cost of to be . Given a path , the probability of successfully completing all corresponding subtasks in order is . Thus, minimizing the sum of edge costs (that is, computing the shortest path) is equivalent to maximizing the probability of completing .
This strategy could be implemented by first running learning for each subtask, and then running Djikstra’s algorithm, with no feedback between the two. In practice, this approach does not work because our assumption that is independent of the starting state does not hold. DIRL uses two ideas to address this issue. First, note we do not need the cost of all edges a priori. Thus, DIRL can estimate edge costs lazily—that is, it estimates the cost of edge only when it is needed by Dijkstra’s algorithm. Whenever an edge cost needs to be estimated, DIRL constructs a (non-sparse) reward function for the corresponding subtask (using quantitative semantics), uses an off-the-shelf RL algorithm to train a policy to perform that subtask and then estimates using simulations.
Second, a key property of Dijkstra’s algorithm is that when the cost of an edge is needed, it has already computed the shortest path from to ; this also implies that DIRL has learned policies to solve subtasks for edges in this path. Thus, we can simulate the environment using these policies to obtain samples of states in (assuming the agent successfully completes the path). This approach not only provides a way to sample initial states for training a subtask policy for , but this distribution of initial states corresponds to the distribution of states encountered when using the learned subtask policy if the final path computed by DIRL contains .
DIRL has been shown to scale well to complex specifications, including realistic applications such as robotic Pick-and-Place environments, where it outperforms prior approaches while requiring fewer than samples. Figures 3b–3e show learning curves of DIRL for some complex specifications in the warehouse environment. Recent work has expanded SPECTRL-based approaches to multi-agent scenarios7,17 and learning verified policies.13,25
Conclusion
In this article, we presented a tutorial-style introduction to recent research on using logical specifications to encode RL tasks. In particular, we discussed both theoretical limitations and practical solutions. Using examples, we illustrated key ideas and concepts such as differences between reward functions and logical specifications, specification robustness, graph representations of logical specifications, and integrating high-level planning with learning low-level control actions. We also demonstrated that one can leverage the structure in the logical specification to improve learning. One main caveat is that the user needs to provide an appropriate specification for the task at hand, and the benefits of specification-guided RL are only fully realized when the task has a natural logical structure—for example, if the task is to simply reach a goal region that is challenging to navigate to (or requires complex maneuvers), specification-guided RL provides limited added benefits when compared to using reward functions since the high-level task is easily encoded using rewards. Nonetheless, specification-guided RL shows tremendous promise for many applications of RL with complex task structures. Several promising directions exist for future work, such as handling raw perceptual inputs including challenges regarding evaluating predicates under partial observability, leveraging LLMs to generate formal specifications from natural language descriptions, and zero-shot generalization to new tasks encoded using logical specifications.




Join the Discussion (0)
Become a Member or Sign In to Post a Comment