Artificial Intelligence and Machine Learning

Specification-Guided Reinforcement Learning

A tutorial-style introduction to recent research on using logical specifications to encode RL tasks illustrates theoretical limitations and practical solutions.

By Rajeev Alur, Suguman Bansal, Osbert Bastani, and Kishor Jothimurugan

Posted Jan 28 2026

Learning Optimal Policies
Infinite Horizon Specifications
Finite Horizon Specifications
Conclusion
References
Footnotes

In reinforcement learning (RL), an agent learns to achieve its goal by interacting with its environment and learning from feedback about its successes and failures. This feedback is typically encoded as numerical rewards—for example, positive rewards for beneficial outcomes and negative rewards for harmful outcomes. Traditional RL algorithms assume the reward is provided by the user, and designing an effective reward function can be a daunting task. For example, consider a robot organizing boxes in a warehouse. To ensure the robot successfully moves boxes to the right locations while avoiding collisions, we might provide positive rewards for moving boxes in the right directions and negative rewards for collisions. However, there is a wide variety of ways to implement such a reward function: Should rewards be given immediately after moving a box or after multiple successful moves? How should failures such as incorrect placements or collisions be handled? There is also the question of balancing the objective with safety constraints—if the warehouse robot receives rewards solely for obstacle avoidance, it might focus on safety and never move, whereas overweighting the objective may lead it to take dangerous shortcuts. Beyond the question of balancing the objective with safety constraints, the reward function may also need to introduce new variables to keep track of progress—for example, which rooms still have boxes that need to be cleared.

The art of designing an effective reward function is called reward engineering. The challenge is that reward engineering conflates the problem of task specification (i.e., what is the task to be performed?) with the problem of reward shaping (i.e., what rewards best guide the robot to solve the task?). Recent work has proposed to separate these concerns by using logical specifications to encode the desired behavior of an RL agent, which can automatically be compiled into shaped rewards to train the agent. Temporal logics such as linear temporal logic (LTL)²⁰ are particularly suitable for specifying RL tasks, as they provide rigorous syntax and semantics for reasoning about sequences of events and states over time. For instance, instead of crafting numerical rewards, the desired behavior of the warehouse robot can be encoded as a sequence of logical steps: (1) Go to object while avoiding obstacles, (2) Pick up object, (3) Go to target location while avoiding obstacles, and (4) Place object. Each step can be encoded using logical predicates. If $g$ denotes a target location and $o$ denotes an obstacle location, then the requirement of safely visiting location $g$ can be written in LTL as Eventually $g \land$ Always $\neg o$ , where Eventually and Always are temporal operators.

The key question then is how to train an agent to satisfy a given logical specification. One strategy would be to compile the specification into a shaped reward function, which can then be used in conjunction with a traditional RL algorithm to train an agent. However, compared to a reward function, the logical specification transparently represents task structure in a way that can be exploited by the RL algorithm. In particular, specification-guided RL aims to design novel RL algorithms that train agents directly from logical specifications.

In this article, we provide an overview of some recent progress on specification-guided RL. This is an active area of research, with a growing body of literature.¹^,⁵^,⁶^,⁹^–¹²^,¹⁴^,¹⁷^–¹⁹^,²²^,²⁴ We first highlight surprising negative theoretical results that provide limits on specification-guided RL in the infinite horizon setting. Our discussion of these results provides an insight into some key differences between logical specifications and reward functions in the context of RL (curious readers can find formal proofs and further details in the cited references). In practice, it often suffices to require that the agent completes the assigned task within a specified finite number of steps. This finite horizon setting offers opportunities for positive results and practical algorithms. As a representative example, we describe DIRL, an algorithm that combines conventional reward-based RL with high-level planning to efficiently learn policies for complex tasks. Rather than crafting a single reward function for the entire task, DIRL exploits the structure of the logical specification to automatically infer a set of subtasks. Each subtask is associated with its own reward function and represented as an edge in a task graph, which serves as the foundation for high-level planning. DIRL alternates between planning over the task graph and learning subtask policies. This decomposition allows DIRL to tackle complex tasks with significantly fewer environment interactions compared to baseline methods. Through this integration of symbolic structure and learning, DIRL highlights several advantages of specification-guided RL.

Learning Optimal Policies

The objective of RL is to train the agent to act in an environment to accomplish a specific task—for example, learn what action should be performed in any given state so that the robot will eventually reach a specific room. A key feature of RL is the environment is assumed to be unknown. Hence, the policy that enables the agent to act must be learned by exploring the environment.

Markov decision processes. The environment is modeled as a Markov decision process (MDP). Figure 1a shows an MDP with states $S = {s_{0}, s_{1}, s_{2}, s_{3}}$ and actions $A = {a_{1}, a_{2}}$ . Actions cause state transitions according to transition probabilities $T : S \times A \times S \to [0, 1]$ . For example, $a_{1}$ from $s_{0}$ leads to $s_{1}$ or $s_{3}$ with equal probability. The initial state distribution $η$ describes the distribution of states the system initializes in. Here, $η$ is the dirac distribution at $s_{0}$ —that is, with $η (s_{o}) = 1$ . Mathematically, an MDP $M$ is a tuple consisting of these four components^a—that is, $M = (S, A, T, η)$ . An infinite run of the MDP is an infinite sequence of state-action pairs describing the evolution of the system over time. A possible infinite run in Figure 1a is $ρ = s_{0} \overset{a_{2}}{\to} s_{2} \overset{a_{1}}{\to} s_{1} \overset{a_{2}}{\to} s_{1} \overset{a_{2}}{\to} \dots$ in which the agent reaches $s_{1}$ from $s_{0}$ via $s_{2}$ . Similarly, a finite run of length $H$ describes the evolution of the system for $H$ time steps—for example, $ρ = s_{0} \overset{a_{2}}{\to} s_{2} \overset{a_{1}}{\to} s_{1} \overset{a_{2}}{\to} s_{1} \overset{a_{2}}{\to} s_{1}$ is a finite run of length 4.

In RL, $T$ is a priori unknown, and the agent must learn about the environment by taking actions and observing transitions. In Figure 1a, if the agent repeatedly takes action $a_{1}$ in state $s_{0}$ , it observes that the system transitions to $s_{1}$ approximately half the time. The goal is to learn a policy $π : S \to A$ mapping states to actions. This policy induces a distribution over runs, generated by sampling an initial state from $η$ , and iteratively using $π$ to select actions and sample next states from $T$ . We denote the distribution over infinite runs as $D^{π}$ and over finite $H$ -length runs as $D_{H}^{π}$ .

Reward functions. Traditionally, tasks in RL are specified using a reward function $r : S \times A \to R$ that assigns rewards for state-action pairs. For example, to reach $s_{1}$ in Figure 1a, we might set $r (s, a) = 1 (s = s_{1})$ , which assigns a reward of 1 if $s = s_{1}$ and 0 otherwise. For a run $ρ$ , this naturally generates a reward sequence (e.g., $0, 1, 1, \dots$ for a run reaching $s_{1}$ in one step). For an infinite horizon, RL algorithms typically maximize the expected value of the discounted-sum reward $J (π) = E [\sum_{t = 0}^{\infty} γ^{t} r_{t}]$ where the random variable $r_{t}$ denotes the reward obtained at time $t$ , and discount factor $0 < γ < 1$ indicates the rate of diminishing returns. In Figure 1a, under the reward function $r (s, a) = 1 (s = s_{1})$ , the policy $π_{1}$ that chooses action $a_{1}$ in state $s_{0}$ will achieve a discounted-sum reward of $0 + γ + γ^{2} + \dots = γ / (1 - γ)$ with probability 0.5 (when it reaches $s_{1}$ ) and 0 with probability 0.5 (when it reaches $s_{3}$ ). Hence, $J (π_{1}) = γ / (2 (1 - γ))$ . However, the policy $π_{2}$ that chooses action $a_{2}$ in $s_{0}$ leads to a discounted-sum reward of $0 + \dots + 0 + γ^{k + 2} + γ^{k + 3} + \dots = γ^{k + 2} / (1 - γ)$ if $s_{1}$ is reached in exactly $k + 2$ steps which occurs with probability^b ${(1 - ε)}^{k} ε$ . Thus, the expected discounted-sum reward^c is $ε γ^{2} / (1 - γ) \cdot \sum_{k = 0}^{\infty} {(1 - ε)}^{k} γ^{k} = ε γ^{2} / [(1 - γ) (1 + γ ε - γ)]$ . If, instead, given a finite horizon $H$ , then RL algorithms maximize the expected reward $J_{H} (π) = E [\sum_{t = 0}^{H - 1} r_{t}]$ . Finally, the goal of RL is to learn an optimal policy $π^{*}$ that maximizes $J (π)$ (or $J_{H} (π)$ ); we let $J^{*} = {sup}_{π} J (π)$ (or $J^{*} = {sup}_{π} J_{H} (π)$ ) denote the best possible value.

Logical specifications. A specification $φ$ is a logical formula encoding whether a finite or infinite run of the MDP solves the desired task. For example, the task of reaching $s_{1}$ can be encoded by the specification $φ = Eventually s_{1}$ . A run that starts in $s_{0}$ and transitions immediately to $s_{3}$ will never reach $s_{1}$ , and therefore will not satisfy $φ$ . Logical specifications can be straightforwardly composed to express more complex tasks; for instance, the task of sequentially visiting two locations $l_{1}$ and $l_{2}$ can be simply expressed as $Eventually l_{1}$ ; $Eventually l_{2}$ . The meaning of a specification is given by the semantics function (denoted $〚 φ 〛$ ) that maps a run $ρ$ to ${0, 1}$ indicating whether or not the run satisfies $φ$ . The objective of specification-guided RL is to learn a policy that maximizes the probability of achieving the task—that is, $J^{φ} (π) = E_{ρ \sim D^{π}} [〚 φ 〛 (ρ)]$ . In Fig 1a, the policy $π_{1}$ that chooses action $a_{1}$ in $s_{0}$ assigns a probability of 0.5 to each of the two infinite runs it produces (one that visits $s_{1}$ and the other than does not), so $J^{φ} (π_{1}) = 0.5$ . On the other hand, the policy $π_{2}$ that chooses action $a_{2}$ in $s_{0}$ results in infinitely many trajectories but all of these trajectories eventually visit $s_{1}$ given $ε > 0$ . Thus, $J^{φ} (π_{2}) = 1$ . Clearly, $π_{2}$ is optimal.

Infinite Horizon Specifications

This section discusses results on RL from infinite-horizon specifications. These specifications are important as they can express recurring tasks and persistent behaviors that finite-horizon specifications cannot capture. Linear temporal logic is a prominent example of such specifications. For instance, tasks such as “repeatedly monitor an area” are naturally infinite-horizon properties.

Reductions to Discounted Rewards. A natural approach to solving an infinite-horizon specification $φ$ is to first compile it into a reward function $r_{φ}$ and then apply standard RL algorithms for discounted rewards that come with guarantees and have state-of-the-art implementations readily available. However, recent work² has proven that without knowledge of the environment model, there are specifications for which no discounted reward function $r$ exists (for any discount factor $γ$ ) such that the policies maximizing $r$ will also maximize the probability of satisfying $φ$ . This impossibility result has profound implications: There are specifications that simply cannot be solved by the strategy of reduction to reward functions. This result arises due to a fundamental mismatch between how discounted rewards and logical specifications measure task completion—discounted rewards inherently care about when the task is completed since future rewards are time-discounted, whereas logical specifications such as $Eventually s_{1}$ care only about the task eventually being completed.

This mismatch is illustrated by the specification $Eventually s_{1}$ and the reward function $r (s, a) = 1 (s = s_{1})$ in Figure 1a. Recall that the policy $π_{2}$ which takes action $a_{2}$ in state $s_{0}$ is optimal with regard to $Eventually s_{1}$ . However, this policy can be made suboptimal with regard to $r (s, a) = 1 (s = s_{1})$ for any discount factor $γ$ by choosing an appropriate value of $ε$ .

Recent work shows some promising directions under additional assumptions. For instance, Bozkurt et al.⁴ and Hahn et al.⁹ have shown that any LTL specification can be reduced to discounted rewards with an appropriate discount factor that depends on the MDP transitions. More recently, Alur et al.³ have studied the use of discounted LTL, which naturally prioritizes near-term events over distant ones, allowing a direct reduction to discounted rewards.

Theoretical guarantees. Even if it is impossible to compile logical specification to discounted-sum rewards, we may still hope that we can devise algorithms that directly learn policies to solve infinite-horizon specifications. Our next result shows that the answer remains negative. To formalize this result, we extend the probably approximately correct (PAC) framework²¹ to specification-guided RL. Given a confidence level $δ > 0$ and an error tolerance $ε > 0$ , an RL algorithm is said to be PAC if after sampling a sufficient number of transitions (as a function of $n, m, δ, ε$ ) in an MDP with $n$ states and $m$ actions, the probability that the learned policy $π$ is $ε$ -close to optimal (i.e., $J^{*} - J (π) \leq ε$ ) is at least $1 - δ$ . Note that the required sample size does not depend on the transition probabilities, allowing agents to determine termination without knowing the true environment dynamics.

While algorithms with PAC guarantees exist for discounted-sum rewards, surprisingly, they are impossible for even simple reachability specifications, such as $Eventually g$ (where $g$ is a goal state). To understand why, consider the MDP shown in Figure 1b and the specification $Eventually s_{1}$ . When $p_{1} = 1$ and $p_{2} < 1$ , the optimal policy chooses action $a_{2}$ in state $s_{0}$ since the probability of eventually reaching $s_{1}$ from $s_{2}$ is 1, whereas the probability of reaching $s_{1}$ from $s_{0}$ is 0 if the agent chooses action $a_{1}$ in state $s_{0}$ . Similarly, when $p_{1} < 1$ and $p_{2} = 1$ , the optimal action at state $s_{0}$ is $a_{1}$ . Note that altering the transition probabilities slightly can significantly impact the optimal policy. For example, if $p_{1} = 1$ and $p_{2} = 1 - x$ for an infinitesimally small $x$ , altering the values to $p_{1}^{'} = p_{1} - x = 1 - x$ and $p_{2}^{'} = p_{2} + x = 1$ causes the optimal action at $s_{0}$ to change from $a_{2}$ to $a_{1}$ . Furthermore, an optimal policy in the original MDP (with $p_{1} = 1$ ) that reaches $s_{1}$ with probability 1 will attain a zero probability of reaching $s_{1}$ in the new MDP. This example illustrates that the specification $Eventually s_{1}$ is not robust—small changes to the MDP can drastically affect the corresponding optimal policy. In contrast, discounted-sum rewards are robust in the sense that altering the transition probabilities by a small amount only impacts the value $J (π)$ of a policy $π$ by a small amount (more precisely, $J (π)$ is a continuous function of the MDP transitions and rewards).

This lack of robustness is the reason PAC algorithms do not exist for specifications. For the sake of contradiction, suppose there is an RL algorithm with a PAC guarantee for $Eventually s_{1}$ . We can run this algorithm for the above two scenarios with slightly different transition probabilities and the same values of $δ > 0.9$ and $ε < 1$ . Then, for a sufficiently small $x$ , it is highly likely that the RL algorithm will see the same samples from the two different MDPs. Thus, the RL algorithm will output the same policy $π$ , meaning it must attain a zero probability of reaching $s_{1}$ in one of the two MDPs. As a result, $J^{*} - J (π) = 1 > ε$ for that MDP, which is a contradiction to the PAC property.

Furthermore, Yang et al.²³ show that PAC guarantees are possible only for finitary specifications (a specification $φ$ is finitary if there exists a fixed $H$ such that we can decide whether or not an infinite run $ρ = (s_{0}, s_{1}, \dots)$ of the system satisfies $φ$ by only looking at the first $H$ steps). PAC algorithms also exist under additional assumptions—for example, if a lower bound on all non-zero transition probabilities of the MDP is given.⁸

Finite Horizon Specifications

While infinite horizon poses many theoretical challenges for specification-guided RL, in practice, it is often sufficient to consider finitary specifications. If we have a fixed finite horizon $H$ and the satisfaction of a specification $φ$ only depends on the first $H$ steps of the MDP, then a straightforward approach is to define a reward function $r_{φ}$ that assigns a reward of 1 after $H$ steps if $φ$ is satisfied and 0 otherwise. It is easy to see that $r_{φ}$ faithfully represents $φ$ , and a policy $π$ that maximizes the expected sum of rewards $J_{H} (π)$ also maximizes the probability of satisfying $φ$ . However, using such sparse reward functions that “rarely” assign non-zero rewards often requires a large number of samples from the environment. For example, consider the warehouse environment with interconnected rooms shown in Figure 3a. This is an environment with continuous state and action spaces. Here, a state is a pair of coordinates $s = (x, y) \in R^{2}$ indicating the location of the robot, and actions are velocities $a = (v_{x}, v_{y}) \in R^{2}$ . The initial distribution over states is $η = Uniform (A)$ (where $A$ is the purple circle). Consider the specification $φ$ requiring the robot to reach the region $B$ starting from any position in $A$ within $H$ steps. If we use the above approach, then $r_{φ}$ assigns the agent a non-zero reward only after it reaches $B$ . Thus, the agent receives no meaningful reward feedback until it has already completed the entire task. Using state-of-the-art RL algorithms, the agent might explore for a long time before receiving such a reward. This problem is exacerbated for more complex specifications. A common solution is to use dense reward functions that assign non-zero rewards more frequently to guide the agent towards solving the task. For example, we can define a reward function $r_{ϕ} (s, a) = - dist (s, B)$ measuring the distance between the agent’s current state $s$ and the goal region $B$ , which encourages the robot to move toward $B$ .

Another challenge with encoding specifications using reward functions is that specifications can be non-Markovian or history dependent. However, most existing RL algorithms do not allow rewards to be history dependent—that is, the reward at the current step $r (s, a)$ must be a function of the current state $s$ , and the action $a$ and should be independent of the states visited and the actions taken prior to the current step. For example, suppose $φ$ requires the robot (starting from a state in $A$ ) to visit $X$ first, go back to $A$ , and then visit $E$ . In this case, the reward provided to the agent should depend on whether the robot had previously visited $X$ in the current run. If the robot has not yet visited $X$ in the current run, then $r (s, a)$ should be higher if $s$ is in $X$ than if it is in $E$ since the robot should not be encouraged to go directly to $E$ without visiting $X$ . Conversely, if the robot has already visited $X$ , then $r (s, a)$ should be higher if $s$ is in $E$ than if it is in $X$ since the robot should not be encouraged to visit $X$ again or to stay in $X$ indefinitely. While such history-dependent tasks are easy to encode using logical specifications, they cannot be directly represented as rewards.

There has been a lot of recent work on designing practical specification-guided RL algorithms.¹^,⁵^,⁶^,¹⁰^,¹¹^,¹⁸^,²²^,²⁴ Next, we describe one such algorithm called DIRL which is based on the SPECTRL language.

The SPECTRL language. SPECTRL is a simple specification language that lets users encode tasks such as the ones we discussed in the context of robot navigation.¹⁵ Its syntax is given by the grammar

φ := Eventually p | φ Ensuring p | φ; φ | φ \lor φ

where $p \subset S$ is a predicate representing a subset of the states. In our warehouse example, the region $X \subseteq S = R^{2}$ corresponds to such a predicate. Next, $Eventually p$ denotes eventually (within $H$ steps) reaching a state $s \in p$ , and $φ Ensuring p$ denotes performing $φ$ while remaining in the “safe” set $p$ at every step. Furthermore, sequencing $φ_{1}; φ_{2}$ denotes first performing $φ_{1}$ and then $φ_{2}$ , and disjunction $φ_{1} \lor φ_{2}$ gives the agent the choice of either performing $φ_{1}$ or $φ_{2}$ .

In our warehouse example, the task of visiting $X$ while avoiding the obstacle region $O$ (union of all regions marked in red) can be encoded in SPECTRL as $φ_{1} = (Eventually X Ensuring \neg O)$ , where $\neg O = S O$ denotes the set of all states that do not belong in $O$ . The task of visiting $X$ first and then returning to $A$ before visiting $E$ can be encoded as $φ_{2} = (Eventually X; Eventually A; Eventually E)$ . Finally, $φ_{3} = ((Eventually X \lor Eventually E); Eventually C) Ensuring \neg O$ specifies that the robot must first visit either $X$ or $E$ and then visit $C$ , all while avoiding $O$ .

A specification-guided RL algorithm. Early work on specification-guided RL focused on compiling the specification to a reward function,¹⁵^,¹⁸^,²² including techniques for reward shaping and handling non-Markovian specifications. One effective strategy for reward shaping is to use quantitative semantics for LTL specifications, which assign numerical values to runs of the system that capture how “robustly” the specification is satisfied (while ensuring runs that satisfy the specification are assigned higher values compared to those that do not). To handle history-dependent tasks, one approach is to alter the state space to include relevant information about the history—for example, as a finite state machine called a reward machine.¹² However, some tasks can be challenging to learn with these approaches since even with reward shaping, RL algorithms tend to be myopic, focusing on immediate rewards over long-term ones. Consider our example specification $φ_{3}$ ; a typical reward function implementing this task would provide similar rewards for visiting $X$ and $E$ , since these two cannot be distinguished without understanding the structure of the MDP. In the MDP, there is no direct way to reach $C$ from $E$ , so in fact $X$ is preferable to $E$ . Thus, if by chance an RL algorithm learns to solve the initial subtask $Eventually X \lor Eventually E$ by visiting $E$ instead of $X$ , then it might struggle to reach $C$ .

We describe a more recent specification-guided RL algorithm based on SPECTRL called Dijkstra guided RL (DIRL)¹⁶ that addresses this issue by leveraging the structure of the specification in conjunction with knowledge about the MDP gained during training. DIRL consists of two components: using Dijkstra’s algorithm to search for a high-level sequence of subtasks to perform (a.k.a. planning), and using traditional RL algorithms to learn to perform subtasks (a.k.a. learning). The planning component involves computing a shortest path in a directed graph in which edges represent subtasks. The learning component involves, for each subtask, constructing a reward function and using an off-the-shelf RL algorithm to learn a policy to maximize the expected sum of rewards. Instead of performing learning and then planning, DIRL interleaves the two, incorporating information obtained during learning into planning and vice versa.

Figure 2a provides an overview of the algorithm. First, DIRL automatically constructs an abstract graph $G_{φ} = (V, E)$ representing the high-level structure in the given SPECTRL specification.^d For our example specification $φ_{3}$ , the corresponding abstract graph constructed by DIRL is shown in Figure 2b. Each vertex represents a set of states in the MDP. In this example, the abstract graph has four vertices representing the four regions of our warehouse environment. Each directed edge represents a reachability subtask—namely, the subtask of reaching the set of states represented by the target vertex. For instance, the edge from $A$ to $X$ denotes the subtask of reaching $X$ from $A$ . Each edge is optionally labeled with a path constraint—that is, all edges are labeled with the constraint of avoiding the red obstacle regions. Our example graph has two paths from $A$ to $C$ representing the two ways of solving $φ_{3}$ : (i) first go to $X$ and then go to $C$ , or (ii) first go to $E$ and then go to $C$ . If the agent learns how to perform each subtask along a path in this graph, then the agent can perform the sequence of subtasks dictated by this path to achieve its objective. Thus, the overall problem reduces to two sub-problems corresponding to the two components of DIRL: (i) computing the “best” path in $G_{φ}$ from the initial vertex $v_{0}$ to a goal vertex $v_{g}$ (achieved using Dijkstra’s algorithm^e), and (ii) learning to perform the subtask corresponding to each edge in that path (achieved using RL).

To apply Dijkstra’s algorithm to compute the “best” path, we need to associate a cost to each edge in the graph; ideally, this assignment should ensure that the resulting shortest path corresponds to the high-level plan most likely to solve $φ$ . For intuition, suppose that the probability of completing a subtask corresponding to an edge $e = u \to v \in E$ is independent of the starting state (that is, the specific state in the set of states represented by $u$ ), and that the agent has already learned how to complete the subtask corresponding to each edge in the graph. Then, letting $p_{e}$ be the probability of completing the subtask corresponding to edge $e$ , we can define the cost of $e$ to be $c_{e} = - log (p_{e})$ . Given a path $e_{1}, e_{2}, \dots, e_{ℓ}$ , the probability of successfully completing all corresponding subtasks in order is $\prod_{i = 1}^{ℓ} p_{e_{i}} = exp (\sum_{i = 1}^{ℓ} log (p_{e_{i}}))$ . Thus, minimizing the sum of edge costs (that is, computing the shortest path) is equivalent to maximizing the probability of completing $φ$ .

This strategy could be implemented by first running learning for each subtask, and then running Djikstra’s algorithm, with no feedback between the two. In practice, this approach does not work because our assumption that $p_{e}$ is independent of the starting state does not hold. DIRL uses two ideas to address this issue. First, note we do not need the cost of all edges a priori. Thus, DIRL can estimate edge costs lazily—that is, it estimates the cost $c_{e}$ of edge $e$ only when it is needed by Dijkstra’s algorithm. Whenever an edge cost needs to be estimated, DIRL constructs a (non-sparse) reward function for the corresponding subtask (using quantitative semantics), uses an off-the-shelf RL algorithm to train a policy to perform that subtask and then estimates $p_{e}$ using simulations.

Second, a key property of Dijkstra’s algorithm is that when the cost of an edge $e = u \to v$ is needed, it has already computed the shortest path from $v_{0}$ to $u$ ; this also implies that DIRL has learned policies to solve subtasks for edges in this path. Thus, we can simulate the environment using these policies to obtain samples of states in $u$ (assuming the agent successfully completes the path). This approach not only provides a way to sample initial states for training a subtask policy for $e$ , but this distribution of initial states corresponds to the distribution of states encountered when using the learned subtask policy if the final path computed by DIRL contains $e$ .

DIRL has been shown to scale well to complex specifications, including realistic applications such as robotic Pick-and-Place environments, where it outperforms prior approaches while requiring fewer than $4 \times 10^{5}$ samples. Figures 3b–3e show learning curves of DIRL for some complex specifications in the warehouse environment. Recent work has expanded SPECTRL-based approaches to multi-agent scenarios⁷^,¹⁷ and learning verified policies.¹³^,²⁵

Figure 3. Performance plots showing probability of specification satisfaction ( $y$ -axis) across specifications of increasing complexity against number of samples taken to learn ( $x$ -axis). X→Y is shorthand for specification to reach region $Y$ from region $X$ after visiting one of the corners in the square/rectangle formed by $X$ and $Y$ .

Conclusion

In this article, we presented a tutorial-style introduction to recent research on using logical specifications to encode RL tasks. In particular, we discussed both theoretical limitations and practical solutions. Using examples, we illustrated key ideas and concepts such as differences between reward functions and logical specifications, specification robustness, graph representations of logical specifications, and integrating high-level planning with learning low-level control actions. We also demonstrated that one can leverage the structure in the logical specification to improve learning. One main caveat is that the user needs to provide an appropriate specification for the task at hand, and the benefits of specification-guided RL are only fully realized when the task has a natural logical structure—for example, if the task is to simply reach a goal region that is challenging to navigate to (or requires complex maneuvers), specification-guided RL provides limited added benefits when compared to using reward functions since the high-level task is easily encoded using rewards. Nonetheless, specification-guided RL shows tremendous promise for many applications of RL with complex task structures. Several promising directions exist for future work, such as handling raw perceptual inputs including challenges regarding evaluating predicates under partial observability, leveraging LLMs to generate formal specifications from natural language descriptions, and zero-shot generalization to new tasks encoded using logical specifications.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Specification-Guided Reinforcement Learning

View in the ACM Digital Library

This work is licensed under a Creative Commons Attribution International 4.0 license.
© 2026 Copyright held by the owner/author(s).

DOI

10.1145/3744706

February 2026 Issue

Vol. 69 No. 2

Pages: 80-87

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

News Apr 20 2026

Is There a Way to Solve the Reproducibility Problem?

Samuel Greengard

Computing Profession

News Apr 20 2026

In Memoriam: Michael O. Rabin

Eugene H. Spafford and Simson Garfinkel

Computer History

BLOG@CACM Apr 17 2026

The Invisible Infrastructure Powering the World’s Fastest Growing Companies

Himant Goyal

Architecture and Hardware

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Learning Optimal Policies

Infinite Horizon Specifications

Finite Horizon Specifications

Conclusion

Specification-Guided Reinforcement Learning

DOI

February 2026 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.