Specifications from Demonstrations

Prediction, Learning, and Teaching

Image

Marcell Vazquez-Chanlatte

PhD Candidate at UC Berkeley

Committee: Sanjit Seshia, Shankar Sastry, Anca Dragan, Steven Piantadosi
2022-05-09

mjvc.me/phd

What goals & assumptions are implicit in the way we act?

Three types of motivating applications.

Share road with many actors

Image
Illustration: Bryan Christie Design

Three types of motivating applications.

  1. Inference via demonstrations
  2. Pedagogic example generation for user on-boarding
  3. Human AI Collaboration

Detecting Mode Confusion + Unused Features

Image Image
Saifan, et al. "Using formal methods for test case generation according to transition-based coverage criteria." 2015.

Three types of motivating applications.

  1. Inference via demonstrations
  2. Pedagogic example generation for user on-boarding
  3. Human AI Collaboration

Nobody reads the manual...

Image Image
Saifan, et al. "Using formal methods for test case generation according to transition-based coverage criteria." 2015.

Three types of motivating applications.

  1. Inference via demonstrations
  2. Pedagogic example generation for user on-boarding
  3. Human AI Collaboration

Specializing on the fly

Image
Overcooked cooking simulator

Three types of motivating applications.

  1. Inference via demonstrations
  2. Pedagogic example generation for user on-boarding
  3. Human AI Collaboration

Research Stack

Teaching tasks using demonstrations

Learning tasks from demonstrations

Maximum Entropy Planning

Image
"A Model Counter's Guide to Probabilistic Systems.". (preprint)
"Model Checking Finite-Horizon Markov Chains with Probabilistic Inference", CAV `21
"Maximum Causal Entropy Specification Inference from Demonstrations." CAV `20
"Entropy Guided Control Improvisation", RSS `21

Research Stack

Teaching tasks using demonstrations

Learning tasks from demonstrations

Maximum Entropy Planning

Image
"Learning Task Specifications from Demonstrations." NeurIPS `18.
"Demonstration Informed Specification Search." In submission.

Research Stack

Teaching tasks using demonstrations

Learning tasks from demonstrations

Maximum Entropy Planning

Image
"Communicating Compositional and Temporal Specifications by Demonstrations", `18

How to infer a task?

Image

Many different signals to use for inference

Image

Decades of research in learning rewards from demonstrations

Image

Decades of research in learning rewards from demonstrations

Image

Inverse Reinforcement Learning

Given a series of demonstrations, what reward, $r(s)$, best explains the behavior? (Abbeel and Ng 2004)

Consider an agent acting in the following stochastic grid world.

Image

Can try to move up, down, left, right

Image

May slip due to wind

Image

What is the agent trying to do?

Image

Probably trying to reach yellow tiles

Image

Although these actions are surprising under that hypothesis

Image

And isn't it easier to go to this yellow tile any way?

Image

A lot of information from a single incomplete demonstration

Image Image Image

Communication through actions

Image
  1. Essential for interpreting other signals: (Language, Disengagments, Other agents)
    Goodman, et al. "Pragmatic language interpretation as probabilistic inference." TiCS `16
    McPherson, et al. "Modeling supervisor safe sets for improving collaboration in human-robot teams." IROS `18
    Afolabi, et al. "People as sensors: Imputing maps from human actions." IROS `18.

Communication through actions

Image
  1. Essential for interpreting other signals: (Language, Disengagments, Other agents)
    Goodman, et al. "Pragmatic language interpretation as probabilistic inference." TiCS `16
    McPherson, et al. "Modeling supervisor safe sets for improving collaboration in human-robot teams." IROS `18
    Afolabi, et al. "People as sensors: Imputing maps from human actions." IROS `18.
  2. Actions are incredibly diagnostic.
    Dragan, et al, "Legibility and predictability of robot motion." HRI `13.
    Ho, et al, "Showing versus doing: Teaching by demonstration". NIPS `16
    Sadigh, et al. "Planning for autonomous cars that leverage effects on human actions." RSS `16.

How should we represent learned tasks?

Image

Desired Properties

  1. Decouple task from dynamics.
    • Prefer sparse rewards.
    • Support describing temporal tasks.
      Abel, et al. "On the Expressivity of Markov Reward.", NeurIPS `21.
  2. Support composition.

Markovian Rewards couple environment with task

Image

Reach yellow. Avoid red.

"Learning Task Specifications from Demonstrations." NeurIPS `18.

Taking away top left yellow causes error

Image

Reach yellow. Avoid red.

"Learning Task Specifications from Demonstrations." NeurIPS `18.

Solution: Sparse reward + memory

Image

Reach yellow. Avoid red.

"Learning Task Specifications from Demonstrations." NeurIPS `18.

Solution: Sparse reward + memory

Image

Reach yellow. Avoid red.

Key question: What memory?

Solution: Sparse reward + memory

Image

Reach yellow. Avoid red.

Key question: What memory?

But first: How to represent task?

Proposal: Tasks as Boolean Specifications

  1. A (Boolean) specification, $\varphi$, is a set of traces.
    Image
  2. We say $\xi$ satisfies $\varphi$, if $\xi \in \varphi$.
  3. $[\xi \in \varphi]$ is a sparse objective.

Task Specifications derived from Formal Logic, Automata, etc

Image

Will call a collection of task specifications a concept class.

Image

Support incremental learning

Image

Incrementally Learn smaller/simpler rules

Image

Explictly support learning memory

Image

Specifications have our desired properties

Image

Desired Properties

  1. Decouple task from dynamics.
    • Prefer sparse rewards.
    • Support describing temporal tasks.
      Abel, et al. "On the Expressivity of Markov Reward.", NeurIPS `21.
  2. Support composition.

Research Stack

Teaching tasks using demonstrations

Learning tasks from demonstrations

Maximum Entropy Planning

Dynamics and Tasks

Image
"Learning Task Specifications from Demonstrations." NeurIPS `18.
"Demonstration Informed Specification Search." In submission.

recipe

Image

frame as inverse reinforcement learning

Image

place holder

Image

frame as inverse reinforcement learning

Image

frame as inverse reinforcement learning

Image
\[ \max_{\pi} \big(\mathbb{E}_{s_{1:\tau}}\big(\sum_{i=1}^\tau r(s_i)~|~\pi\big)\big) \]
where \[\pi(a~|~s) = \Pr(a~|~s)\]

given a series of demonstrations, what reward, $r(s)$, best explains the behavior? (Abbeel and Ng 2004)

No unique explanatory reward!

\[ \max_{\pi} \Big(\mathbb{E}_{s_{1:\tau}}\big(\sum_{i=1}^\tau r(s_i)~|~\pi\big)\Big) \]

Maximize Causal Entropy

\[ H(A_{1:\tau}~||~S_{1:\tau}) = \sum_{t=1}^T H\Big(A_{t} \mid S_{1:t}\Big) \]

subject to feature statistics.

(Ziebart, et al. 2010)

Focus on reward matching

Maximize Causal Entropy

\[ H(A_{1:\tau}~||~S_{1:\tau}) = \sum_{t=1}^T H\Big(A_{t} \mid S_{1:t}\Big) \]

subject to $\mathbb{E}[r]$

  • Intuition: Don't over commit to any particular strategy.

    \[\Pr(\pi) \propto e^{\lambda\cdot \mathbb{E}[r(\xi)~\mid~ \pi]}\]

  • Formally: Minimize worst-case excess bits to encode episode.

What should the reward be?

Satisfaction is a sparse objective.

Proposal: Use indicator.

\[ r(\xi) \triangleq \begin{cases} 1 & \text{if } \xi \in \varphi\\ 0 & \text{otherwise} \end{cases} \]

\[\mathbb{E}[r] = \Pr(\xi \in \varphi)\]

Assigns a Demonstration Likelihood

Image
\begin{equation} \Pr(\pi) \propto e^{\lambda \cdot \mathbb{E}[ \Pr(\xi \in \varphi)~~\mid~\pi~]} \end{equation}

Proxy policy exponentially favors high value prefixes

Image
\begin{equation} \ln\pi_\lambda(a~|~s_{1:t}) = Q_\lambda(s_{1:t}, a) - V_\lambda(s_{1:t}) \end{equation}

Will revisit.

Reward Learning offers many ways to rank specifications

Image
  1. Inverse Optimal Control
    • Daniel Kasenberg, Matthias Scheutz. "Interpretable apprenticeship learning with temporal logic specifications." CDC `17
    • Glen Chou, Necmiye Ozay, Dmitry Berenson. "Explaining Multi-stage Tasks by Learning Temporal Logic Formulas from Suboptimal Demonstrations." RSS `22
  2. Bayesian Inference
    • Ankit Shah, Pritish Kamath, Julie A. Shah, Shen Li. "Bayesian Inference of Temporal Task Specifications from Demonstrations." NeurIPS `18
    • Hansol Yoon, Sriram Sankaranarayanan. "Predictive Runtime Monitoring for Mobile Robots using Logic-Based Bayesian Intent Inference." ICRA `21
  3. Maximum Entropy Inverse Reinforcement Learning
    • Marcell Vazquez-Chanlatte, Susmit Jha, Ashish Tiwari, Mark K. Ho, and Sanjit A. Seshia. "Learning Task Specifications from Demonstrations." NeurIPS. `18.
    • Marcell Vazquez-Chanlatte, and Sanjit A. Seshia. "Maximum Causal Entropy Specification Inference from Demonstrations." CAV `20

Key question: What memory?

Problem: No gradient to guide search over discrete structure.

Literature focused on Naïve syntactic search

Literature focused on Naïve syntactic search

Image
Vazquez-Chanlatte, et al, "Learning Task Specifications from Demonstrations." NeurIPs`18.

Problems with syntatic search

Image
Vazquez-Chanlatte, et al, "Learning Task Specifications from Demonstrations." NeurIPs`18.
  1. Conflates inductive bias with search efficiency.
  2. When teaching even harder to justify.
  3. Want generic strategy that works with less structured concept classes.
  4. Ignores the demonstrations!

Solution: Sparse reward + memory

Image

Reach yellow. Avoid red.

Key question: What memory?

"Learning Task Specifications from Demonstrations." NeurIPS `18.

Enough memory to seperate good vs bad

Image Image Image
Image

Results in walk through labeled example space

Image Image Image
Image

Guide search by minimizing surprise

Image
\[ h(\varphi) \triangleq \# \text{ of nats to describe demonstrations assuming } \varphi. \]

Guide search by minimizing surprise

Image
\[ h(\varphi) \triangleq -\sum_{i=1}^n \ln \Pr(\text{demo}_i~|~\varphi, \text{dynamics}) \]

Will see that $h$ factors through $R^n$

Image

Will try to pull back changes prescribed by gradient.

Image

Key idea 1: Reframe choices along demonstration

Image
Image

Many ways to deviate from demonstration

Image
Image

How valuable is each deviation?

Image
\begin{equation} \ln\pi_\lambda(a~|~s_{1:t}) = Q_\lambda(s_{1:t}, a) - V_\lambda(s_{1:t}) \end{equation}

Maximum Causal Entropy Policy

\begin{equation} \ln\pi_{\mathbf{\lambda}}(a~|~s_{1:t}) = Q_{\mathbf{\lambda}}(s_{1:t}, a) - V_{\mathbf{\lambda}}(s_{1:t}) \end{equation}
where
\[ Q_{\mathbf{\lambda}}(s_{1:t}, a) \triangleq \mathbb{E}_{s_{1:t+1}}\left[ V_{\mathbf{\lambda}}(s_{1:t+1})~|~s_{1:t}, a\right] \]
\[ V_{\mathbf{\lambda}}(s_{1:t}) \triangleq \begin{cases} \ln \sum_{a} e^{Q_{\mathbf{\lambda}}(s_{1:t}, a)} & \text{if } t \neq \tau,\\ \lambda\cdot [s_{1:\tau} \in \varphi] & \text{otherwise}. \end{cases} \]

Find $\lambda$ to match $\Pr(\xi \in \varphi)$.

$\ln\sum_x e^x$ ↦ smax.

How valuable is each deviation?

Image

Each deviation's value summarized by $V_\lambda, Q_\lambda$.

\[ Q_{\mathbf{\lambda}}(s_{1:t}, a) \triangleq \mathbb{E}_{s_{1:t+1}}\left[ V_{\mathbf{\lambda}}(s_{1:t+1})~|~s_{1:t}, a\right] \]
\[ V_{\mathbf{\lambda}}(s_{1:t}) \triangleq \begin{cases} \text{smax}_{a} Q(s_{1:t}, a) & \text{if } t \neq \tau,\\ \lambda\cdot [s_{1:\tau} \in \varphi] & \text{otherwise}. \end{cases} \]

Reframe as conforming or deviating

Image

$\mathbb{E}$ and $\text{smax}$ are associative and commutative

\[ Q_{\mathbf{\lambda}}(s_{1:t}, a) \triangleq \mathbb{E}_{s_{1:t+1}}\left[ V_{\mathbf{\lambda}}(s_{1:t+1})~|~s_{1:t}, a\right] \]
\[ V_{\mathbf{\lambda}}(s_{1:t}) \triangleq \begin{cases} \text{smax}_{a} Q(s_{1:t}, a) & \text{if } t \neq \tau,\\ \lambda\cdot [s_{1:\tau} \in \varphi] & \text{otherwise}. \end{cases} \]

Two choices along a demonstration

Image

$\mathbb{E}$ and $\text{smax}$ are associative and commutative

\[ Q_{\mathbf{\lambda}}(s_{1:t}, a) \triangleq \mathbb{E}_{s_{1:t+1}}\left[ V_{\mathbf{\lambda}}(s_{1:t+1})~|~s_{1:t}, a\right] \]
\[ V_{\mathbf{\lambda}}(s_{1:t}) \triangleq \begin{cases} \text{smax}_{a} Q(s_{1:t}, a) & \text{if } t \neq \tau,\\ \lambda\cdot [s_{1:\tau} \in \varphi] & \text{otherwise}. \end{cases} \]

Two choices along a demonstration

Image
Image

Two choices along a demonstration

Image
Image
  1. Independent of number of actions and state
  2. Only need transition probabilities in demonstrations.

Key idea 2: View demonstration as a computation tree

Image

Environment nodes average

Image

Agent nodes smoothmax

Image

Demonstrations map to computation graph

Image

Surprise determined by pivot values.

Image
Image Image

Surprise factors through a function over pivot values.

Image
\[ \begin{split} &\widehat{h} : \mathbb{R}^{d} \to \mathbb{R}\\ &h(\varphi) = \widehat{h}(\vec{V}_\varphi) \end{split} \]

Gradient of proxy surprise easy to compute.

Image
\[ \frac{\partial \widehat{h}}{\partial V_k}(\vec{V}) = \sum_{\substack{(i, j) \in E\\i \text{ is ego}}} \#_{(i, j)} \cdot \bigg(\Pr(i \rightsquigarrow k\mid \vec{V}) - \Pr(j \rightsquigarrow k\mid \vec{V})\bigg) \]

Key idea 3: Toggle trace labels to manipulate pivot values.

Each pivot's subtree summarized by V

Image

Each path given binary label by specification

Image

Flipping value monotonically changes subtree value

Image
$V_3 < V_3'$

Can sample from $\pi$ to generate high weight paths.

Flipping value monotonically changes subtree value

Image
Image

Flipping value monotonically changes subtree value

Image
Image

Surprise Guided Sampler

  1. Fix a candidate spec and compute proxy gradient.
  2. Define the pivot distribution, $D$ = softmax $\left (\frac{|\nabla \widehat{h}|}{\beta}\right)$.
  3. Sample a pivot, $k \sim D$ and a path, $\xi$, using the $\pi_\varphi$ such that:
    1. $\xi$ pivots at $k$.
    2. $\nabla_k \widehat{h} > 0 \iff \xi \in \varphi$.
  4. New label given by $\nabla_k \widehat{h} < 0$.
Image

Candidate sampler can be any exact learner

Image
  • DFAs: Heule, Marijn JH, and Sicco Verwer. "Exact DFA identification using SAT solvers." 2010.
  • LTL: Neider, Daniel, and Ivan Gavran. "Learning linear temporal properties." 2018.

Lifted in simulated annealing using example buffer

Image

Lifted in simulated annealing using example buffer

Image

Results in walk through labeled example space

Image Image

Two Experiments

  1. Incremental learning using 1 incomplete unlabeled demo Image
  2. Monolithic learning using 2 complete unlabeled demos. Image

DISS able to quickly find probable task specs

Image Image

Most probable DFA almost matches ground truth.

  1. Incremental learning using 1 incomplete unlabeled demo Image
  2. Monolithic learning using 2 complete unlabeled demos. Image

Research Stack

Teaching tasks using demonstrations

Image

Learning tasks from demonstrations

Maximum Entropy Planning

Dynamics and Tasks

Joint work with Tom Griffiths, Mark Ho, Ameesh Shah, and Sanjit.
Ho, Mark K., et al. "Showing versus doing: Teaching by demonstration." NeurIPs`16

Can generate pedagogic demonstrations

Image

Helps Teach Conditional Rules

Image Image

Recipe to generate pedagogic paths

Tasks induce a description length on demonstrations

Find likely paths given ground truth that have:

  1. Short description length given learned task distribution.
  2. Long description length given ground truth task.
Image

Research Stack

Teaching tasks using demonstrations

Learning tasks from demonstrations

Maximum Entropy Planning

Image
"A Model Counter's Guide to Probabilistic Systems.". (preprint)
"Model Checking Finite-Horizon Markov Chains with Probabilistic Inference", CAV `21
"Maximum Causal Entropy Specification Inference from Demonstrations." CAV `20
"Entropy Guided Control Improvisation", RSS `21
Time Check! Click to skip.

Summary

Image

Random Bit Model

Idea: Model Markov Decision Process as deterministic transition system with access to $n_c$ coin flips.

Image

Note: Principle of maximum causal entropy + finite horizon together are robust to small dynamics mismatches.

Random Bit Model

Next: Assume \(\#(\text{Actions}) = 2^{n_a} \)

Image

Random Bit Model

Next: Assume \(\#(\text{Actions}) = 2^{n_a} \)

Image
\[ \text{Dynamics} : S \times {\{0, 1\}}^{n_a + n_c} \to S \]

Random Bit Model

Unrolling \(\tau\) steps and composing with specification results in a predicate.

Image
\[ \psi : {\{0, 1\}}^{\tau\cdot (n_a + n_c)} \to \{0, 1\} \]

Random Bit Model

\[ \psi : {\{0, 1\}}^{\tau\cdot (n_a + n_c)} \to \{0, 1\} \]

Proposal: Represent \(\psi\) as a Binary Decision Diagram with bits in causal order.

Image

Random Bit Model

\[ \psi : {\{0, 1\}}^{\tau\cdot (n_a + n_c)} \to \{0, 1\} \]

Proposal: Represent \(\psi\) as a Binary Decision Diagram with bits in causal order.

Image

Maximum Causal Entropy and BDDs

Q: Can Maximum Entropy Causal Policy be computed on causally ordered BDDs?

Maximum Causal Entropy and BDDs

Q: Can Maximum Entropy Causal Policy be computed on causally ordered BDDs?

A: Yes! Due to:

  1. Associativity of \(\text{smax}\) and \(\mathbb{E}\).
  2. \(\text{smax}(\alpha, \alpha) = \alpha + \ln(2)\)
  3. \(\text{E}(\alpha, \alpha) = \alpha\)

Maximum Causal Entropy and BDDs

  1. Associativity of \(\text{smax}\) and \(\mathbb{E}\).
  2. \(\text{smax}(\alpha, \alpha) = \alpha + \ln(2)\)
  3. \(\text{E}(\alpha, \alpha) = \alpha\)
Image

Maximum Causal Entropy and BDDs

  1. Associativity of \(\text{smax}\) and \(\mathbb{E}\).
  2. \(\text{smax}(\alpha, \alpha) = \alpha + \ln(2)\)
  3. \(\text{E}(\alpha, \alpha) = \alpha\)
Image

Size Bounds

Q: How big can these Causal BDDs be?

\[ |BDD| \leq \overbrace{\underbrace{\tau}_{\text{horizon}} \cdot \big( \log(|A|) + \text{#coins} \big)}^{\text{# inputs}} \cdot \big( \underbrace{|S \times S_\varphi \times A|}_{\text{composed automaton}} \cdot 2^{\text{#coins}} \big) \]

Size Bounds

\[ |BDD| \leq \overbrace{\underbrace{\tau}_{\text{horizon}} \cdot \big( \log(|A|) + \text{#coins} \big)}^{\text{# inputs}} \cdot \big( \underbrace{|S \times S_\varphi \times A|}_{\text{composed automaton}} \cdot 2^{\text{#coins}} \big) \]

Linear in horizon!

Note: Using function composition, can build BDD in polynomial time.

Summary

Image

Talk was biased towards Learning Contributions

  1. Learn task specifications from (un)labeled and (in)complete demonstrations in a MDP.
    Image Image Image
  2. Support incremental and monolithic learning.
    Image Image
  3. Only needs blackbox access to a MaxEnt planner and Concept Identifer.
    Image Image
  • Marcell Vazquez-Chanlatte, Susmit Jha, Ashish Tiwari, Mark K. Ho, and Sanjit A. Seshia. "Learning Task Specifications from Demonstrations." NeurIPS. `18.
  • Marcell Vazquez-Chanlatte, and Sanjit A. Seshia. "Maximum Causal Entropy Specification Inference from Demonstrations." CAV `20
  • Marcell Vazquez-Chanlatte, Ameesh Shah, Gil Lederman, and Sanjit A. Seshia. "Demonstration Informed Specification Search" In submission.

Contributions

  1. Learn task specifications from (un)labeled and (in)complete demonstrations in a MDP.
  2. Support incremental and monolithic learning.
  3. Only needs blackbox access to a MaxEnt planner and Concept Identifer.
  4. Can automatically generate pedagogic demonstrations.
    Image
  5. Efficient MaxEnt planning in symbolic Stochastic Games.
    • Marcell Vazquez-Chanlatte, Susmit Jha, Ashish Tiwari, Mark K. Ho, and Sanjit A. Seshia. "Learning Task Specifications from Demonstrations." NeurIPS. `18.
    • Marcell Vazquez-Chanlatte, and Sanjit A. Seshia. "Maximum Causal Entropy Specification Inference from Demonstrations." CAV `20
    • Marcell Vazquez-Chanlatte, Ameesh Shah, Gil Lederman, and Sanjit A. Seshia. "Demonstration Informed Specification Search" In submission.
    Image Image Image
  6. Exact Probablistic Model Checking of Finite Markov Chains
    • Sebastian Junges, Steven Holtzen, Marcell Vazquez-Chanlatte, Todd Millstein, Guy Van den Broerk, Sanjit A. Seshia. "Model Checking Finite-Horizon Markov Chains with Probabilistic Inference", CAV `21

By no means solved

Image

Clear path to scaling up

Image
  1. Need estimate of policy on prefix tree.
  2. Need a way to sample likely paths from policy.

Plays nicely with approximate methods

Image
  1. Monte Carlo Tree Search (e.g., Smooth Cruiser [1])
  2. Function Approximation (e.g., Soft Actor Critic [2])

Promising preliminary results using Graph Neural Networks

[1] Grill, et al. "Planning in entropy-regularized Markov decision processes and games." NeurIPs`19.
[2] Haarnoja, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." PMLR, `18.

Works with any supervised learner

Image

Can in principle do Decision Trees and Symbolic Automata.

Works with any supervised learner

Image

Could use natural language or other signals for priors.

Working on variant for constraints

Image

Fix the reward and infer constraints on behavior.

Multi-agent (inferred) assume guarantee reasoning underexplored

\[ A \implies G\]

Image

Maximum causal entropy correlated equillibria seem like an interesting model
Ziebart, et al., "Maximum causal entropy correlated equilibria for Markov games." AAAI `10.

Thank you

Image
Committee, Mentors, and Co-Authors (code and papers)

Thank you

Image
Friends and Family

Thank you

Image

Slides: mjvc.me/phd