Patrick Emami

Plug & Play Directed Evolution of Proteins with Gradient-based Discrete MCMC

Thu, 05 Jan 2023 00:00:00 +0000

Patrick Emami, Aidan Perreault, Jeff Law, David Biagioni, Peter C. St. John. Machine Learning: Science and Technology, 2023. Presented at NeurIPS’22 Workshop on Machine Learning in Structural Biology.

[DOI] [arxiv] [Paper code] [EvoProtGrad library]

The marginal frequency of each sequence position visualized on the 3D structure of the wild type is the fraction of protein variants in the sampled variant pool that have a mutation at that sequence position. For example, nearly 100% of discovered UBE4B variants have a mutation at position 1145.

Announcing EvoProtGrad

7/21/23: We have released a Python package accompanying this work with HuggingFace integration for supporting a wide range of protein language models. Check it out on Github.

pip install evo_prot_grad

Quick demo

Evolve the avGFP protein with Rostlab/prot_bert.

import evo_prot_grad

prot_bert_expert = evo_prot_grad.get_expert('bert', temperature = 1.0)

variants, scores = evo_prot_grad.DirectedEvolution(
                   wt_fasta = 'test/gfp.fasta',    # path to wild type fasta file
                   output = 'best',                # return best, last, all variants    
                   experts = [prot_bert_expert],   # list of experts to compose
                   parallel_chains = 1,            # number of parallel chains to run
                   n_steps = 20,                   # number of MCMC steps per chain
                   max_mutations = 10,             # maximum number of mutations per variant
                   verbose = True                  # print debug info to command line
)()

Summary

Machine-learning-based directed evolution can be used to engineer new proteins that have improved productivity or catalyze new reactions. Basically, starting from a known protein sequence, directed evolution proceeds by identifying candidate mutations, experimentally verifying them, and then using the gained information to search for better mutations. The intention is for this iterative process to resemble natural evolution. One way ML can help streamline this design loop is by optimizing the candidate indentification step.

In this paper, we introduce a plug and play protein sampler for this step. The sampler mixes and matches unsupervised evolutionary sequence models (e.g., protein language models) with supervised sequence models that map sequence to function, without needing any model re-training or fine-tuning. Sampling is performed in sequence space to keep things as ‘‘black-box’’ as possible. Plug and play samplers have been successful in other domains, such as for controlled image and text generation.

Searching through protein sequence space efficiently, which is high-dimensional and discrete, is notoriously difficult. We address this in our MCMC-based sampler by using the gradient of the (differentiable) target function to bias a mutation proposal distribution towards favorable mutations. We show that this approach outperforms classic random search.

Impact: Improving the quality of candidate protein variants reduces the time spent conducting costly experiments in the wet lab and accelerates the engineering of new proteins.

Checking intuitions with a toy binary MNIST experiment

The task is to evolve the MNIST digit on the right, initially a 1, to maximize the sum of the two digits solely by flipping binary pixels. Since the digit on the left is a 9, the initial sum of the digits is 10. The maximum is 18 (9 + 9).

Here’s our sampler applied to a product of experts distribution that combines an unsupervised energy based model with a supervised ensemble of ConvNets trained to regress the sum.

If we remove the EBM from the product of experts, the search gets trapped in adversarial optima—white-noise images. This is because, in this case, gradient-based discrete MCMC is only using the gradient of the ConvNet ensemble to steer search. The EBM acts as a soft constraint that keeps search away from such optima.

If we replace gradient-based discrete MCMC with random-search-based simulated annealing, nearly every randomly sampled mutation is rejected as it does not increase the predicted sum.

In-silico directed evolution experiments

We conduct in-silico directed evolution experiments on 3 proteins with partially-characterized wide fitness landscapes (between 2-15 mutations from wild type): PABP_YEAST, UBE4B_MOUSE, and GFP.

Compared to random-walk-based black-box samplers (simulated annealing, Metropolis-adjusted Langevin Algorithm (MALA)), random sampling, and an evolutionary method (CMA-ES), our sampler explores sequence space most efficiently (higher $\log p(x)$ is better).

We show that we can plug in different protein language models: ESM2-35M, ESM2-150M, and ESM2-650M. We also combine ESM2-150M with a Potts model, which surprisingly achieved the best performance (red arrow).

See the paper and code for more details and results.

Part 1: Deploying a PyTorch MobileNetV2 Classifier on the Intel Neural Compute Stick 2

Fri, 09 Jul 2021 00:00:00 +0000

This is the first part of a three part tutorial on using the Intel Neural Compute Stick 2 (NCS2) for vehicle tracking at traffic intersections. The goal of the first part is to get familiar with the NCS2 by walking through how to convert and run a PyTorch MobileNetV2 model on the NCS2 via Windows 10. The following guide should also work for supported popular Linux distros like Ubuntu 18.04.x and 20.04.0 as well as macOS 10.15.x with minor modifications.

Getting started

I bought my NCS2 off Amazon for about 80 USD.

We’ll start by setting it up on a Windows 10 PC. My PC has a 6-core Intel i5-9600K processor.

To begin, plug the NCS2 into a USB port on your machine. I then followed the steps from the Intel Windows guide to install their OpenVINO toolkit. As for the dependencies, using Anaconda for Windows to get Python 3.8 was easy and painless. Make sure to follow the steps to install Microsoft Visual Studio 2019 with MSBuild and CMake as well. The only optional step to follow in the guide is to add Python and CMake to your Path by updating the environment variable.

Once you finish, to test the installation, run the following demo in the Windows CLI:

cd "C:\Program Files (x86)\Intel\openvino_2021\deployment_tools\demo"
.\demo_squeezenet_download_convert_run.bat –d MYRIAD

You should see predicted scores:

The PyTorch to ONNX Conversion

Next, we’ll try to port a pre-trained MobileNetV2 PyTorch model to the ONNX format based on this tutorial.

Install PyTorch (cpu-only is fine) following the instructions here and ONNX with pip install onnx onnxruntime. If you are using a clean Python 3.8 conda environment, you may also want to install jupyter at this time.

You can run the code snippets shown throughout the remainder of this guide by downloading and launching this Jupyter notebook. See the ONNX tutorial linked above for detailed explanations of each cell.

# Some standard imports
import numpy as np
import torch
import torch.onnx
import torchvision.models as models
import onnx
import onnxruntime

We can grab a pre-trained MobileNetV2 model from the torchvision model zoo.

model = models.mobilenet_v2(pretrained=True)
model.eval()

Export the model to the ONNX format using torch.onnx.export.

batch_size = 1
# Input to the model
x = torch.randn(batch_size, 3, 224, 224, requires_grad=True)
torch_out = model(x)

# Export the model
torch.onnx.export(model,                     # model being run
                  x,                         # model input (or a tuple for multiple inputs)
                  "mobilenet_v2.onnx",       # where to save the model (can be a file or file-like object)
                  export_params=True,        # store the trained parameter weights inside the model file
                  opset_version=10,          # the ONNX version to export the model to
                  do_constant_folding=True,  # whether to execute constant folding for optimization
                  input_names = ['input'],   # the model's input names
                  output_names = ['output'], # the model's output names
                  dynamic_axes={'input' : {0 : 'batch_size'},    # variable length axes
                                'output' : {0 : 'batch_size'}})

This will create a 14MB file called mobilenet_v2.onnx. Let’s check its validity using the onnx library.

onnx_model = onnx.load("mobilenet_v2.onnx")
onnx.checker.check_model(onnx_model)

Finally, let’s run the model using the ONNX Runtime in an inference session to compare its results with the PyTorch results.

ort_session = onnxruntime.InferenceSession("mobilenet_v2.onnx")

def to_numpy(tensor):
    return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

# compute ONNX Runtime output prediction
ort_inputs = {ort_session.get_inputs()[0].name: to_numpy(x)}
ort_outs = ort_session.run(None, ort_inputs)

# compare ONNX Runtime and PyTorch results
np.testing.assert_allclose(to_numpy(torch_out), ort_outs[0], rtol=1e-03, atol=1e-05)

print("Exported model has been tested with ONNXRuntime, and the result looks good!")

The ONNX to OpenVINO Conversion

Next we need to pass mobilenet_v2.onnx through the OpenVINO Model Optimizer to convert it into the proper intermediate representation (IR).

I followed the steps listed here.

Go to <OPENVINO_INSTALL_DIR>/deployment_tools/model_optimizer.
Run python mo.py --input_model <PATH_TO_INPUT_MODEL>.onnx --output_dir <OUTPUT_MODEL_DIR> --input_shape (1,3,224,224)

This creates an *.xml file, a *.bin file, and a *.mapping file, which altogether make up the IR for running on the NCS2.

Running the MobileNetV2 IR on the NCS2

Follow along again with the same Jupyter notebook. Download the ImageNet label file and this test image and place them in the same directory as the notebook. In the Windows CLI, run call <OpenVINO_INSTALL_DIR>\bin\setupvars.bat to initialize the Inference Engine Python API, which is needed to import the OpenVINO python libraries (you may need to restart the notebook).

I had to manually edit L38 of <OPENVINO_INSTALL_DIR>\python\python3.8\ngraph\impl\__init__.py to read if (3, 8) > sys.version_info to fix an error. Similarly for L28 of <OPENVINO_INSTALL_DIR>\python\python3.8\openvino\inference_engine\__init__.py. Note that this is seems to be because I am using Python 3.8 and that this may get fixed in a subsequent version of OpenVINO.

from __future__ import print_function
import sys
import os
import cv2
import numpy as np
import logging as log
# These are only available after running setupvars.bat
from openvino.inference_engine import IECore
import ngraph as ng

Then we set some variables and initialize the Inference Engine.

model = "<PATH_TO>\\mobilenet_v2.xml"
device = 'MYRIAD'  # Intel NCS2 device name
log.basicConfig(format="[ %(levelname)s ] %(message)s", level=log.INFO, stream=sys.stdout)
log.info("Loading Inference Engine")
ie = IECore()  # Inference Engine

Read the model in OpenVINO Intermediate Representation (.xml).

log.info(f"Loading network:\n\t{model}")
net = ie.read_network(model=model)

Load the plugin for inference engine and extensions library if specified.

log.info("Device info:")
versions = ie.get_versions(device)
print(f"{' ' * 8}{device}")
print(f"{' ' * 8}MKLDNNPlugin version ......... {versions[device].major}.{versions[device].minor}")
print(f"{' ' * 8}Build ........... {versions[device].build_number}")

Read and preprocess input, following the preprocessing steps for torchvision pre-trained models.

input = 'car.png'
for input_key in net.input_info:
    if len(net.input_info[input_key].input_data.layout) == 4:
        n, c, h, w = net.input_info[input_key].input_data.shape
        
images = np.ndarray(shape=(n, c, h, w))
images_hw = []
for i in range(n):
    image = cv2.imread(input)
    ih, iw = image.shape[:-1]
    images_hw.append((ih, iw))
    log.info("File was added: ")
    log.info(f"        {input}")
    if (ih, iw) != (h, w):
        log.warning(f"Image {input} is resized from {image.shape[:-1]} to {(h, w)}")
        image = cv2.resize(image, (w, h))
    # BGR to RGB
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    # Change data layout from HWC to CHW 
    image = image.transpose((2, 0, 1))  
    # Change to (0,1)
    image = image / 255. 
    # Subtract mean
    image -= np.array([0.485, 0.456, 0.406]).reshape(3,1,1)
    # Divide by std 
    image /= np.array([0.229, 0.224, 0.225]).reshape(3,1,1)
    images[i] = image  

Prepare input and output data. We use the default FP32 precision, but for future use, we could explore models that use lower precisions for faster runtimes.

log.info("Preparing input blobs")
assert (len(net.input_info.keys()) == 1 or len(
    net.input_info.keys()) == 2), "Sample supports topologies only with 1 or 2 inputs"
out_blob = next(iter(net.outputs))
# Just a single output, the class prediction
for input_key in net.input_info:
    input_name = input_key
    net.input_info[input_key].precision = 'FP32'
    break

data = {}
data[input_name] = images

Load the model to the NCS2 and do a single forward pass.

log.info("Loading model to the device")
exec_net = ie.load_network(network=net, device_name=device)
log.info("Creating infer request and starting inference")
res = exec_net.infer(inputs=data)

Display the top 10 predictions.

log.info("Processing output blob")
log.info(f"Top 10 results: ")
labels_map = []
with open('mobilenet_v2.labels', 'r') as f:
    for line in f:
        labels_map += [line.strip('\n')]
classid_str = "label"
probability_str = "probability"
for i, probs in enumerate(res[out_blob]):
    probs = np.squeeze(probs)
    top_ind = np.argsort(probs)[-10:][::-1]
    print(f"Image {input}\n")
    print(classid_str, probability_str)
    print(f"{'-' * len(classid_str)} {'-' * len(probability_str)}")
    for id in top_ind:
        det_label = labels_map[id] if labels_map else f"{id}"
        label_length = len(det_label)
        space_num_before = (7 - label_length) // 2
        space_num_after = 7 - (space_num_before + label_length) + 2
        print(f"{det_label} ... {probs[id]:.7f} %")
    print("\n")

car.png

Result:

label probability
----- -----------
convertible ... 11.8281250 %
beach wagon, station wagon, wagon, estate car, beach waggon, station waggon, waggon ... 11.5546875 %
sports car, sport car ... 11.4296875 %
pickup, pickup truck ... 11.2734375 %
car wheel ... 10.7968750 %
grille, radiator grille ... 10.3515625 %
Model T ... 8.8203125 %
minivan ... 8.7578125 %
cab, hack, taxi, taxicab ... 8.5625000 %
racer, race car, racing car ... 8.1250000 %

Great! In part two of this tutorial, we’ll step through the process of running MobileNetV2-SSDLite for object detection on the NCS2 with a Raspberry Pi.

References

Sandler, Mark, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. “Mobilenetv2: Inverted residuals and linear bottlenecks.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510-4520. 2018.

A Symmetric and Object-centric World Model for Stochastic Environments

Tue, 08 Dec 2020 00:00:00 +0000

Patrick Emami, Pan He, Anand Rangarajan, Sanjay Ranka.
NeurIPS Object Representations for Learning and Reasoning Workshop. 2020. (Spotlight presentation)

tl;dr: We identify limitations of applying certain object-centric world models to stochastic environments and propose a state space model and training objective that achieves better stochastic future prediction.

Abstract

Object-centric world models learn useful representations for planning and control but have so far only been applied to synthetic and deterministic environments. We introduce a perceptual-grouping-based world model for the dual task of extracting object-centric representations and modeling stochastic dynamics in visually complex and noisy video environments. The world model is built upon a novel latent state space model that learns the variance for object discovery and dynamics separately. This design is motivated by the disparity in available information that exists between the discovery component, which takes a provided video frame and decomposes it into objects, and the dynamics component, which predicts representations for future video frames conditioned only on past frames. To learn the dynamics variance, we introduce a best-of-many-rollouts objective. We show that the world model successfully learns accurate and diverse rollouts in a real-world robotic manipulation environment with noisy actions while learning interpretable object-centric representations.

[paper] [code] [poster]

Demos - BAIR Towel Pick 30K with noisy actions

From left to right: ground truth, Ours, OP3, VRNN:

Object decompositions, Ours:

What Can Neural Networks Reason About?

Sun, 22 Mar 2020 00:00:00 +0000

Xu, Li, Zhang, Du, Kawarabayashi, Jegelka, 2020

Summary

This work proposes a theory they call “algorithmic alignment” to explain why some classes of neural net architectures generalize much better than others on certain reasoning problems. They use PAC learning to derive sample complexity bounds that show that the number of samples needed to achieve a desired amount of generalization increases when certain subproblems are “hard” to learn.

For example, they empirically show that DeepSets can easily learn summary statistics for a set of numbers, like “max” or “min”. Since a single MLP has to learn the aggregation function (an easy subproblem) plus the for loop over all elements (a harder subproblem), their theory suggests that the number of samples required to acheive good generalization at test time is much higher for the MLP, which their experiments confirm. Interestingly, they explain why graph neural nets (GNNs) align well with dynamic programming (DP) problems (because of their iterative message-passing style updates), but then also explain why they do not align well with NP-Hard problems. They provide further experimental evidence on a shortest-paths DP problem and Subset-Sum NP-Hard problem to verify this.

What About Transformers?

The paper doesn’t discuss Transformers, so as a simple exercise, I thought about how it fits into their framework. Transformers map readily onto fully connected GNNs, which suggests that they should “align algorithmically” with DP problems but not NP-Hard problems. Note that for a set of $K$ objects, a multi-head attention Transformer/GNN performs an $O(K^2 d)$ operation. This highlights one limitation of Transformers; they can easily reason about object-object relations, but will struggle to generalize when faced with higher-order relations such as object-object-object. Relational reasoning over $k$-partite graphs, $k > 2$, shows up in certain NP-Hard problems like Multidimensional Assignment. It seems like algorithm alignment will certainly be useful for future research on designing neural net architectures for NP-Hard problems.

MLSS 2019

Thu, 15 Aug 2019 00:00:00 +0000

Now that the 2019 London MLSS is over, I thought I’d share a couple things about my experience and summarize some of the fascinating technical content I learned over the 10 days of lectures. This will not only help improve my recall of the material, but will also make it easy to share some of what I learned with my lab and with others that weren’t able to attend. If you find any mistakes, please feel free to reach out so I can make corrections. :)

First and foremost, however, let me send a huge thank you to the organizers, Arthur Gretton and Marc Deisenroth, for putting this amazing event together. It was an incredibly rewarding experience for me. Also, thank you to all the lecturers and volunteers as well!

Before I jump into discussing technical highlights…

Wait, why are PhD students going to summer school?

For those of you reading this that may not know what a summer school is, summer schools are typically focused on a specific topic, have daily lectures and practicals, and can last from a few days to two weeks. The location of the summer school is usually in an exciting place, so that students can explore the area together. Usually there’s also a poster session for students to share their ongoing work. At this MLSS, days were structured as two 2-hour lectures followed by a practical which involved writing code for some exercises.

From what I’ve been told and from what I experienced, one of the best reason to attend a summer school is to make friends with other highly-motivated students from around the world. We were lucky enough that Arthur and Marc assembled an amazingly diverse cohort. Out of the ~200 students that attended, 53 nationalities were represented!

It wasn’t all work

I was surrounded by a ton of brilliant people from the moment I arrived in London. Here are some highlights from the fun times we had:

Experienced the hottest day in UK history. OK so this wasn’t “fun” but our joint suffering was a great bonding experience. I’m used to this type of heat back in Florida, but in London most buildings don’t have air conditioning.
On the Saturday afternoon we bought ~half the food at a local grocery store to feed a large group of us that were having a picnic in Green park near Buckingham Palace
A group of us saw A Midsummer’s Night Dream at the Open Air Theatre in Regent’s Park

After our tour of the Tower of London

Technical highlights from MLSS

I’ll only be discussing some of the lectures in this post, but all of the slides and lectures are available at the MLSS Github page.

Optimization

Lecturer: John Duchi [Slides] [Video 1] [Video 2]

John first introduced to us the main ideas in convex optimization and then covered some special topics. A big takeaway from his lecture is to stop and think about each particular optimization problem for a bit instead of just tossing SGD at it (supposedly we shouldn’t say “SGD” because the trajectory taken by stochastic gradient methods like “SGD” don’t strictly “descend”, rather they wander around quite a bit). His point is that we can do a lot better.

The concepts of subgradients and a subdifferential were new to me. A vector $g$ is a subgradient of a function $f$ at $x$ if $f(y) \geq f(x) + <g, y - x>.$

In a picture:

A subgradient of the absolute value function

The subdifferential is the set of all vectors $g$ that satisfy the above inequality. These are useful because they can be computed so long as $f$ can be evaluated, which means we can derive optimization algorithms in terms of subgradients to make them more broadly applicable.

Another new concept was the notion of $\rho$-weakly convex functions. We learned that regularizing your function to make it sort-of convex by adding a large quadratic to it results in what is known as a $\rho$-weakly convex function.

Ultimately, John gave us three steps to follow:

Make a local approximation to your function
Add regularization to “avoid bouncing around like an idiot”
Optimize approximately

Additional Resources

Interpretability

Lecturer: Sanmi Koyejo [Slides] [Video 1] [Video 2]

Key takeaways:

When building an ML system for a task where the repercussions of false positives/false negatives are unknown, it is usually best to defer decision-making to someone downstream
Monotonicity as an interpretability criterion: if feature $X$ “increases”, the output should “increase” as well
A standard definition of what it means for a model to be “interpretable” is still missing

Additional Resources

Gaussian Processes

Lecturer: James Hensman [Slides] [Video 1] [Video 2]

GPs have always been a bit mind-boggling to me. Good thing MLSS was held in the UK, where there seems to be a high concentration of GP/Bayesian ML researchers. Also, James’ slides contain amazing visuals.

What’s a GP? As Mackay puts it, GPs are smoothing techniques that involve placing a Gaussian prior over the functions in the hypothesis class.

I highly recommend checking out Lab 0 and Lab 1 from the GP practical here which were great for building the intuitions.

Multivariate normals

GPs heavily make use of multivariate normal distributions (MVN). In fact, GPs are based on infinite-dimensional MVNs… It took a while for me to wrap my head around this, but these plots from James’ slides really helped:

samples from a 2D MVN. See slides for GIF versions.

The plot on the RHS is the important one. The x-axis represents the two dimensions of this MVN; there are no units for the x-axis, it simply depicts each of the N dimensions of an MVN from 1 to N (the order matters!). The y-axis depicts the values of a random variate sampled from this MVN. A line is drawn connecting $f(x_1)$ to $f(x_2)$. For a 6D MVN, there would be 6 ticks on the x-axis and a line connecting all 6 function values. The trick with GPs is that we (theoretically) let the number of dimensions go to infinity, so the x-axis becomes a continuum. But in practice, we only have a finite number of data points, and we assume the $(x_i, f(x_i))$ pairs are observations of this unknown function.

Confused? Let’s take a step back. In Lab 0, we build some intuition by looking at Linear Regression and then the extension to Bayesian Linear Regression. For GP regression, we make the usual assumption that some unknown target function generated our data, except now our hypothesis class is much more expressive. GPs place a prior over the functions in the hypothesis class. They do this by defining a (Gaussian) stochastic process indexed by the data points:

\[p\Biggl(\begin{bmatrix} f(x_1)\\ f(x_2) \\ \cdots \\ f(x_n) \end{bmatrix} \Biggr) = \mathscr{N} \Biggl( \begin{bmatrix} \mu(x_1)\\ \mu(x_2) \\ \cdots \\ \mu(x_n) \end{bmatrix}, \begin{bmatrix} k(x_1, x_1) & \cdots & k(x_1,x_n)\\ \vdots & & \vdots \\ k(x_n, x_1) & \cdots & k(x_n, x_n)\end{bmatrix} \Biggr).\]

Covariance functions

Note that we can usually transform our data to be zero-mean, so the mean function $\mathbf{\mu}$ is generally assumed to be $0$ and our main concern is the kernel matrix $K$, also known as the gram matrix. $K$ is the covariance function of the GP. Unlike regular (Bayesian) Linear Regression, GPs have a certain useful structure that comes with formulating the problem as a (Gaussian) stochastic process. That is, the kernel matrix $K$ enables us to incorporate prior knowledge about how we expect the data points to interact. In the next section, I’ll take about kernels and the kernel matrix $K$ in more detail.

Once we have the basic machinery of GPs down, we can do some cool stuff like compute the conditional distribution of the function that fits a test point $x^*$ given a training set of point pairs $(x_i, f(x_i))$.

We also covered model selection using the marginal likelihood of a GP. Kernels are typically parameterized by a lengthscale and variance parameter, and the marginal likelihood objective for selecting these kernel parameters nicely trades off model complexity with data fit.

Finally, I’ll briefly mention something we touched on at the end of the GP lecture that was interesting. Since GPs require inverting $K$ to do inference, sparse GPs cleverly select a small set of “pseudo-inputs” that approximate the original GP well and allow for using GPs on much larger datasets.

GPs are useful in modern ML because many problems are concerned with interpolating a function given noisy observations using a Bayesian approach.

Additional Resources

Kernels

Lecturer: Lorenzo Rosasco [Slides] [Video 1] [Video 2]

You may have heard of the kernel trick, which involves replacing all inner products of features in an algorithm with a kernel evaluation, e.g., as in Kernel K-Means.

More than just a trick

Lorenzo showed us some recent work he’s been doing on scaling up kernel methods to the sizes of datasets commonly consumed by deep learning. To do this type of research, it is important to understand the theory of kernel methods. We covered some of this during the lecture, like Reproducing Kernel Hilbert Spaces (RKHSs). I think this lecture was one of the most difficult to swallow in the time allotted from the entire MLSS.

RKHS

Here is my attempt to summarize my understanding of RKHSs. Please see Arthur’s notes (linked below), especially Section 4.1, for a great explanation.

An RKHS is an abstract mathematical space closely intertwined with the notion of kernels. An RKHS is an extension of a Hilbert space; that is, it has all of the properties of a Hilbert space (it’s a vector space, has an inner product, and is complete), plus two other key properties. Note that RKHSs are function spaces; the primitive object that lives in an RKHS is a function.

In fact, evaluating a function in the RKHS maps points from $\mathscr{X} \rightarrow \mathbb{R}$, where $\mathscr{X}$ is the data. However, the “representation” of a function in the RKHS could be an infinite-dimensional vector, if the feature maps corresponding to a particular kernel are infinite dimensional. Since we care about the inner product (evaluation) of the functions in the RKHS, we never actually have to deal with these individual, potentially infinite-dimensional representations. This took me a long time to wrap my head around (see Lemma 9 in Arthur’s notes).

To my understanding, the fact that certain kernels (e.g., the RBF kernel) are said to lift your original feature space to an “infinite-dimensional feature space” comes from the fact that such a kernel is defined as an infinite series of inner products of the original features. This is valid when the series converges (the assumption here is that each feature map is in L2).

If we are given an RKHS, then the key tool we can leverage, which comes for free with the RKHS, is the reproducing property of said RKHS. The reproducing property of the RKHS says that the evaluation of any function in the RKHS at a point $x \in \mathscr{X}$ is the inner product of the function with the corresponding reproducing kernel of the RKHS at the point. The Riesz representation theorem guarantees for us that such a reproducing kernel exists.

Lorenzo showed us how we can start with feature maps and arrive at RKHSs. That is, if you aren’t given an RKHS, you can instead start by defining a kernel as the inner product $k(x,y) := \langle \phi(x), \phi(y) \rangle$ between feature maps. Equivalently, we can define a kernel as a symmetric, positive definite function $k : \mathscr{X} \times \mathscr{X} \rightarrow \mathbb{R}$, and a theorem (Moore-Aronszajn) tells us that there’s a corresponding RKHS for which $k$ is a reproducing kernel. However, in the latter case, there are infinitely many feature maps corresponding to the symmetric, positive definite function we started with.

Additional Resources

MCMC

Lecturer: Michael Betancourt [Slides] [Video 1] [Video 2]

What is Bayesian computation?

Specify a model
Compute posterior expectation values

The first part of our MCMC lecture was all about having us update our mental priors about high-dimensional statistics. For example, although we may think that the mode of a probability density contributes most to an integral when integrating over parameter space, this falls apart quickly in higher dimensions. This is because the relative amount of volume the mode occupies in high dimensions rapidly disappears ($d\theta \approx 0$), so that wherever the majority of the probability mass is sitting and occupying a non-trivial amount of volume ($d\theta$ » $0$) contributes most to the integral. We learned that probability mass concentrates on a thin and hard to find hypersurface called the typical set in higher dimensions.

What about variational methods?

Variational inference, which replaces integration with an optimization problem and makes simplistic assumptions about the variational posterior, underestimates the variance and favors solutions that land inside the target typical set. Yikes!

Markov chains

Monte carlo methods use samples from the exact target typical set. Markov chains are great because a Markov transition that targets a particular distribution will naturally concentrate towards its probability mass. In practice for MCMC, Michael showed us that the chain quickly converges onto the typical set. The main challenge of MCMC methods seem to be handling various failure modes that cause divergence or cases where the chain gets stuck somewhere within the typical set.

Diagnosing your MCMC

From the law of large numbers, we get that Markov chains are asymptotically consistent estimators. However, due to various pathologies, we need more sophisticated diagnostic tools and algorithms to get good finite time behavior. One diagnostic tool is the effective sample size (ESS). The more anti-correlated the samples from the Markov chain are, the higher the ESS.

A robust strategy is to run multiple chains from various initial states and compare the expectations. Another strategy is to check the potential scale reduction factor, or R-hat, which is similar to an ANOVA between multiple chains.

Hamiltonian Monte Carlo

HMC is an alternative to Metropolis-Hastings MCMC. One issue with MH is that a naive Normal proposal density will almost always jump outside of the typical set, because most of the volume lies outside. Lowering the std dev of the proposal to avoid this means convergence will be too slow.

We want to efficiently explore the typical set without leaving it. Insight: align a vector field to the typical set and then integrate along the field. HMC is the formal procedure for adding just enough momentum to the gradient vector field to align the gradient field with the typical set. The analogy Michael used is that we need to add the momentum needed for a satellite to enter orbit around the earth.

How to do this? Use Hamiltonian measure-preserving flow, which is a scalar function $H(p,q)$ of the parameters $q$ and momentum $p$ such that $H(p,q) = -log (\pi(p|q)) -\log (\pi(q))$. Here, $\pi(q)$ is the target density, which is specified by the problem. $\pi(p|q)$ is a conditional density for the momentum, which we choose. According to Michael, this is the only way of producing the gradient vector field we want to integrate along that actually works (this is a magic, hand-wavy thing).

The Markov transitions now consist of randomly sampling a Hamiltonian trajectory. Motion is governed by the above Hamiltonian, which has a kinetic and potential energy term. From the current location on the typical set, we randomly sample the momentum and the trajectory length, and then use an ODE numerical integrator to compute where we end up. I picture this like a marble being released with some initial momentum inside a bowl-shaped surface; wherever the marble finally rests after rolling around for a while is analogous to the next sample from the posterior we draw and is the next state of the chain.

Numerical integration introduces some errors, so we have to be careful about the type of integration we use and how we correct for the introduced bias in our estimator. Use Stan in practice for your MCMC needs.

Additional Resources

Fairness

Lecturer: Timnit Gebru [Slides] [Video]

Timnit gave an incredibly engaging lecture, which also doubled as a nice break from brain-bending equations. Here are some key takeaways:

Timnit talked about “Fairwashing”, which is also discussed by Zachary Lipton in his recent TWiML interview. Basically, it is not enough to take a contentious research topic and simply try to “make it fair”. The idea of “make X fair” misses the point: we need to step back and critically consider the implications of doing the research in the first place. Ask: should we even be building this technology? And then the follow-up questions: if we don’t, will someone else build it anyways? If so, what do we do?

We learned about her PhD project where they took Google street view images and ran computer vision data mining on images of cars to extract patterns about the US population. Since I grew up in Jacksonville, FL, I thought I’d include this slide showing that it is the 10th most segregated city (measured by disparity in average car prices) in the US. Yikes!

The red in the center is San Marco/Riverside. Duuuval

When the data is biased (and data collected out in the real world reflects the current state of the society from which it was extracted from), model predictions will be biased. Some examples we saw: using ML to predict who to hire, placing current ML techniques in the hands of ICE, predictive policing, gender classification, and facial recognition.

Science is inherently political. We all have a responsibility to be cognizant of the dual-use nature of the technologies we create. Even people working on “theory” must adhere to this. The people working on the foundations of our modern statistics (like Fisher) were eugenicists!

Additional Resources

Submodularity

Lecturer: Stefanie Jegelka [Slides] [Video]

This was a fairly niche topic and most people hadn’t heard of it. Essentially, it is a technique in combinatorial optimization for optimal subset selection. Subset selection is clearly relevant to ML, e.g., for active learning, coordinate selection, compression, and so on. However, optimizing a set function over subsets $F(s), s \subseteq S$ is non-trivial; sets are discrete objects, and for any set $S$ there are $2^S$ possible subsets!

It turns out there are specific classes of functions called submodular functions that can be optimized in a reasonable amount of time. These functions have a property called submodularity. In words, this means that if you have a subset $T$ and a subset $S$, and $T \subseteq S$, adding a new element $a \notin S$ to the “smaller” subset $T$ should give you a higher value than adding it to $S$. This is also referred to as diminishing gains. Written out, it is

\[F(T \cup \{a\}) - F(T) \geq F(S \cup \{a\}) - F(S), T \subseteq S, a \notin T.\]

You can also think about $F$ like a concave function, such that its “discrete derivative” is non-increasing.

An example of a submodular function is the coverage function: given $A_1, …, A_n \subset U$,

\[F(S) = \Bigl | \bigcup_{j \in S} A_j \Bigr |.\]

As your coverage $|S|$ increases, the increase in the number of subsets in the union achieved by adding another index $j’$ to $S$ will be smaller.

Convex or concave?

It is not obvious whether submodular functions are more like convex or concave functions, as they share similarities with both. The fact that there are similarities hints at the reason why they are interesting, since convex and concave functions are “nice” to optimize in continuous optimization. We already mentioned the non-increasing discrete derivative property, which is similar to concave functions. The similarity to convexity comes from the fact that it is easier to minimize a submodular function than it is to maximize one (e.g., Max Cut; this is NP-Hard).

Greedy algorithms

We discussed greedy algorithms for maximizing submodular functions during the lecture. The basic algorithm for finding

$\max_{S} F(S) \texttt{ s.t. } |S| \leq k$ is as follows:

$S_0 = \emptyset$
$\texttt{for } i=1,…,k-1$
1. $e^* = \texttt{arg max}_{e \in \mathscr{V} \setminus S_i} F(S_i \cup {e})$
2. $S_{i+1} = S_i \cup {e^*}$

This works because $F$ is submodular - the marginal improvement of adding a single item gives information about the global value. Furthemore, $F$ is monotonic in that adding items can never reduce $F$. Lots more on this in the slides.

Continuous relaxation for minimization

What if we want to minimize a submodular function? We can consider a continuous relaxations of $F$ to a convex function. The Lovasz extension provides a method for relaxing a function defined on the Boolean hypercube to $[0,1]^N$. The result is a piece-wise, linear convex function $f(z)$ that can be optimized using subgradient methods, and a final result can be obtained by rounding.

Additional Resources

Learning From Demonstrations in the Wild

Wed, 26 Dec 2018 00:00:00 +0000

Behbahani et al., 2018

Summary

The motivation behind this work is to develop an automated process for learning the behaviors of road users from large amounts of unlabeled video data. A generative model (trained policy) of road user behavior could be used within a larger traffic scene understanding pipeline. In this paper, they propose Horizon GAIL, an imitation-learning algorithm based on GAIL, that stabilizes learning from demonstration (LfD) over long horizons. Expert policy demonstrations are provided by a slightly improved Deep SORT tracker, and they use PPO as the “student” RL algorithm. The Unity game engine is used to build an RL env that mimcs the scene from the real-world environment to rollout their PPO Horizon-GAIL agent. Their experiments are on 850 minutes of traffic camera data of a large roundabout. By using a curriculum where the episode horizon is extended by 1 timestep each training epoch, they demonstrated how Horizon-GAIL can match the expert policy’s state/action distribution much more closely than GAIL, PS-GAIL, and behavior cloning while also improving on training stability.

Observations

The ability to auto-generate the Unity env from Google Maps would be crucial to scaling this technique up. maps2sim?
They provided an empirical comparison of DeepSORT with ViBe’s vision tracker, and showed that running the Kalman Filter in 3D space improved Deep SORT’s performance in multiple multi-object tracking metrics by a few percentage points
Each road user is modeled independently, i.e., the policy does not account for other agents in the environment explicitly. It looks like the policy used for learning vehicle and pedestrian behavior is the same, although because of the Mask R-CNN detector, they are able to differentiate between the two classes. In scenarios where the behaviors exhibited by the road users can be highly unpredictable and diverse (a busy traffic intersection with heavy pedestrian presence), perhaps a hierarchical policy could be useful that conditions on the inferred object class.
Interesting future work might include incorporating multi-agent modeling in the RL framework for more complex traffic scenarios.

Addressing Function Approximation Error in Actor-Critic Methods & Discriminator-Actor-Critic- Addressing Sample Inefficiency and Reward Bias in Adversarial Imitation Learning

Thu, 13 Sep 2018 00:00:00 +0000

Fujimoto et al., 2018 and Kostrikov et al., 2018

Summary

I discuss two recent related papers in the Deep RL literature in this post. The first paper, by Fujimoto et al., introduces techniques for reducing bias and variance in a popular actor-critic method, Deep Deterministic Policy Gradient (DDPG). The second paper, by Kostrikov et al., makes a similar contribution by evaluating and addressing bias and variance in inverse RL. Both of these papers take widely used Deep RL algorithms, empirically and theoretically demonstrate specific weaknesses, and suggest reasonable improvements. These are valuable studies that help develop a better understanding of Deep RL.

Addressing Function Approx. Error in AC Methods

If you are unfamiliar with DDPG, you can check out my blog post on the algorithm. The most important thing to know is that the success of the whole algorithm relies on having a critic network that can accurately estimate $Q$-values. The only signal the actor network gets in its gradient to help it achieve higher rewards comes from the gradient of the critic wrt the actions selected by the actor. If the critic gradient is biased, then the actor will fail to learn anything!

In Section 4, the authors begin by empirically demonstrating the overestimation bias present in the critic network (action-value estimator) in DDPG. They show that the overestimation bias essentially stems from the fact that DPG algorithms have to approximate both the policy and the value functions, and the approximate policy is maximized in the gradient direction provided by the approximate value function (rather than the true value function). Then, inspired by Double Q-Learning, they introduce a technique they call “clipped Double Q-Learning in AC” for achieving the same idea. Basically, the critic target becomes $y_1 = r + \gamma \min_{i=1,2} Q_{\theta'_i} (s', \pi_{\theta_1}(s')).$ This requires introducing a second critic. The min makes it so that it is possible to underestimate Q-values, but this is preferable to overestimation.

Then, to help with variance reduction, they suggest:

Delay updating the actor network until the critic network has almost converged
Add some noise to the actions selected by the actor network when updating the critic to help regularize the critic, reminiscent of Expected SARSA

Their experimental results on MuJoco (they call their algorithm TD3) suggest these improvements are very effective, outperforming PPO, TRPO, ACKTR, and others.

Discriminator-Actor-Critic: Addressing Sample Inefficiency and Reward Bias in Adversarial Imitation Learning

EDIT: The title of the paper was previously “Addressing Sample Inefficiency and Reward Bias in Inverse RL”

Seemingly inspired by the former, this paper recently came out exploring inverse RL—specifically, generative adversarial imitation learning (GAIL) and behavioral cloning. In GAIL, the discriminator learns to distinguish between transitions sampled from an expert and those from a trained policy. The policy is rewarded for confusing the discrminiator. However, GAIL is typically quite sample inefficient.

One way the authors suggest to help with the sample inefficiency of GAIL is by using off-policy RL instead of on-policy RL. They modify the GAIL objective to be $\max_D \mathcal{E}_\mathscr{R} [\log(D(s,a))] + \mathcal{E}_{\pi_E}[\log(1 - D(s,a))] - \lambda H(\pi).$ Basically, $\pi_E$ is the expert policy, from which trajectories are sampled, and $\mathscr{R}$ is the replay buffer, from which trajectories are sampled from ~all previous policies. They ignore the importance sampling term in practice. Since TD3 is technically a deterministic policy gradient algorithm, I’m assuming one way to implement this importance sampling would be to have the actor output the mean of a multivariate Gaussian—this Gaussian could then be used to define the entropy term of the policy and the importance sampling ratio. This is fairly common for continuous control tasks like MuJoco…the authors note that the importance sampling wasn’t used in practice, however.

They further analyzed different reward functions for GAIL, and show that certain GAIL reward functions can actually inhibit learning depending on the particular MDP (e.g., if the environment has a survival bonus or penalty). To create a more robust reward function that will learn the expert policies, they suggest explicitly learning rewards for absorbing states of the MDP. They implement this by adding an indicator to these particular states so that the GAIL discriminator can identify whether reaching an absorbing state is desirable from the perspective of the expert. In the OpenReview thread, one reviewer makes sure to point out that the problems with inverse RL algorithms highlighted in the paper are due to incorrect implementations of the MDP, rather than shortcomings of the algorithms themselves (see this comment in particular).

Very interestingly, they used VR to generate expert trajectories of gripping blocks with a Kuka arm. This environment has a per-step penalty, and the normal GAIL reward fails to learn the expert policy due to the the learning inhibition caused by the reward function bias. The proposed method learns to imitate the expert quickly due to its added reward for absorbing states.

It would be great to investigate the effects of using off-policy samples in the objective more carefully (why exactly does importance sampling not matter? The absorbing state reward stuff being so useful is surprising, and should be helpful in future applications where GAIL is used for inverse RL.

RUDDER: Return Decomposition for Delayed Rewards

Fri, 22 Jun 2018 00:00:00 +0000

Arjona-Medina, Gillhofer, et al., 2018

Intro

Delayed rewards is one of the fundamental challenges of reinforcement learning. This paper proposes an algorithm for converting an MDP with delayed rewards into a different MDP with equivalent optimal policies where the delayed reward is now converted into immediate rewards. They accomplish this with “reward redistribution”—a potential solution to the RL credit assignment problem.

Basically, they argue that reward redistribution enables monte-carlo (MC) and temporal-difference (TD) methods to be applied under delayed rewards without high variance or biasing the Q-values. This is important, because TD methods are the most commonly used RL approach, and it is proven in this paper that with delayed rewards TD methods take time exponential in the length of the delay to remove the bias.

Some details about RUDDER

One of the main obstacles that is addressed in Section 3 is proving that the new MDP with redistributed rewards has the same optimal policies as the original MDP. I found that this section (and the previous) were unecessarily hard to read, but here is my take on what the key points are:

Because the delayed reward MDP obeys the Markov property, the transformed MDP with equivalent optimal policies but immediate rewards needs to have an altered set of states. This altered set of states obey the Markov property for predicting the immediate rewards, but crucially, not for the delayed reward. This is accomplished by using differences between states $\triangle (s, s’)$ in the delayed reward MDP as the new set of states
The return of the delayed reward MDP at the end of an episode should be predicted by a function $g$ (the LSTM) that can be decomposed into a sum over the state, action pairs of the episode
It should hold that the partial sums that all together sum to $g$, $\sum_{\tau = 0}^{t} h(a_{\tau}, \triangle(s_{\tau}, s_{\tau+1}))$, equal the Q-values at time $t$ for the reward redistribution to be optimal
This approach requires strong exploration that can actually uncover the delayed reward. If a robot gets a +1 for doing the dishes correctly and +0 reward otherwise, it may never even see the +1 in the first place!
To help with the above, they use a “lessons replay buffer” to store episodes where the delayed reward was observed, to sample from when computing gradients for improved data efficiency
An LSTM is used for the return prediction, and techniques for credit assignment in deep learning like layer-wise relevance propagation and integrated gradients are used for “backwards analysis” to accomplish the reward redistribution. This gets pretty complex, and the appendix goes into detail of how they modified the LSTM cell for this (something called a “monotonic LSTM”)
They have to introduce other auxiliary tasks to encourage the reward prediction LSTM network to learn the optimal reward redistribution and get around the Markov property problem mentioned earlier. These aux tasks are to predict the q-values, predict the reward in the next 10 steps, and predict the reward accumulated from time 0 so far

Experimental results

The authors note that RUDDER is aimed at RL problems with (i) delayed rewards (ii) no distractions due to other rewards or changing characteristics of the environment (iii) no skills to be learned to receive the delayed reward. (iii) seems to imply that RUDDER could be combined with hierarchical RL. (ii) suggests that when there are sporadic rewards throughout an episode, the improvements from RUDDER might be minimized. Hence, the environments they experiment with all have heavily delayed rewards observed only at the end of the episode, and their results are naturally impressive. I understand the difficulty (high computational/$$$ cost) in evaluating on all 50-something Atari games, but it would be really nice to see how it fares on other games that aren’t perfectly set up for RUDDER.

They also note that human demonstrations could be used to fill the lessons buffer, which might further improve recent impressive imitation learning results.

Grid world

They compute an optimal policy for a simple grid world with a delayed reward MDP. Then, they compute the bias and variance from the mean square error (MSE) of the Q-values for MC and TD estimators with policy evaluation. They demonstrate the exponential dependency on the reward delay length for the MC and TD methods by varying the delay. RUDDER shows favorable reductions in the bias and variance of estimated vs. optimal Q-values at all delay lengths.

Charge-discharge

Simple 2-state and 2-action MDP with episodes of variable length and delayed reward obtained only at the end of the episode. RUDDER augmenting a Q-learning agent is much more data efficient than pure Q-learning, MC agent, and MCTS.

Atari evaluation

The most impressive results of the paper are the scores for the Venture and Bowling Atari games. They set new SOTA on Venture. They augment PPO with the return prediction network paired with integrated gradients for backwards analysis. The lessons buffer is used to train both the return prediction network and A2C (for PPO). I’m not sure about that because I thought PPO was an on-policy algorithm- using episodes from the lessons buffer would bias the policy gradient? They use the difference between two consecutive frames for $\triangle(s, s’)$ plus the current frame (for seeing static objects) as input.

NN architectures for RUDDER augmenting PPO

Visual depiction of the reward redistribution

They show that RUDDER is much more data efficient than APE-X, DDQN, Noisy-DQN, Rainbow, and other SOTA Deep RL approaches on these two games. Video.

Closing thoughts

I think this is an exciting RL paper that will have a major impact. Mainly because I can see lots of future research refining the ideas from this paper and then applying it to a variety of domains, especially in robotics. This paper could use polishing (some parts are not easy to read) and some more experiments showing how it performs on MDPs with both delayed rewards and also some other “distractor” rewards. I think this setting is most common, as it’s usually possible to do a little reward hacking to augment sparse/delayed reward MDPs. In the appendix, they go into the engineering details on how they got this to work, and it seems like that process + how the hyperparameters/auxiliary tasks/choice of method for backwards analysis will need to change for different environments will need to get cleaned up before making this approach widely useful.

Z-Forcing: Training Stochastic Recurrent Networks

Fri, 22 Jun 2018 00:00:00 +0000

Goyal, et al., 2017

Intro

A new training procedure for recurrent VAEs is proposed. Recall that for VAEs, we model a joint distribution over observations $x$ and latent variables $z$, and assume that $z$ is involved in the generation of $x$. This distribution is parameterized by $\theta$. Maximizing the marginal log-likelihood $p_{\theta}(x)$ wrt $\theta$ is intractable bc it requires integrating over $z$. Instead, introduce a variational distribution $q_{\phi}(z|x)$ and maximize a lower bound on the marginal log-likelihood–the ELBO.

Stochastic recurrent networks

When applying VAEs to sequences, it has been proposed to use recurrent networks for the recognition network (aka inference network aka variation posterior) and the generation network (aka decoder aka conditional probability of the next observation given previous observations and latents). These probabilistic models can be autoregressive (in this paper, they use LSTMs with MLPs for predicting the parameters of Gaussian distributions). It is common to model these conditional distributions with Gaussians for continuous variables or categoricals for discrete variables.

Usually, the prior over latent variables is also learned with a parametric model.

If I’m not mistaken, learning the parameters of these parametric models with a training data set, and the using them at test time for fast inference is referred to as amortized variational inference, which appears to have correlaries in our cognition.

Z-forcing

Strong autoregressive decoders overpower the latent variables $z$, preventing the CPD from learning complex multi-modal distributions. To mitigate this, they introduce an auxiliary cost to the training objective. An extra parametric model is introduced, $p_{\eta}(b | z)$, that “forces” the latents to be predictive of the hidden states $b$ of the “backward network” (the inference network).

Experiments

They validate the approach on speech modeling (TIMIT, Blizzard) and language modeling. The metric is average LL. On Seqeuential MNIST, z-forcing is competitive with “deeper” recurrent generative models like PixelRNN.

Some fun language modeling results interpolating the latent space

Takeaways

It’s always a consideration as to whether increasing the complexity of an approach (adding an extra network and auxiliary cost) is worth the effort vs. simpler approaches that can get almost the same performance. The results on TIMIT and Blizzard are pretty convincing. The authors also suggest incorporating the auxiliary loss with PixelRNN/CNN in future work.

Tracking Occluded Objects and Recovering Incomplete Trajectories by Reasoning about Containment Relations and Human Actions

Sun, 03 Jun 2018 00:00:00 +0000

Liang, Zhu, Zhu, 2018

Summary

This paper looks at tracking severely occluded objects in long video sequences.

I like this passage:

Although some recent work adopted deep neural networks
to extract contexts for object detection and tracking, these
data-driven feedforward methods have well-known problems:
i) They are black-box models that cannot be explained
and only applicable with supervised training by fitting the
typical context of the object, thus difficult to generalize
to new tasks. ii) Lacking explicit representation to handle
occlusions, low resolution, and lighting variations—there
are millions of ways to occlude an object in a given image
 (Wang et al. 2017), making it impossible to have enough
data for training and testing such black box models. In this
paper, we go beyond passive recognition by reasoning about
time-varying containment relations.

They look at containment relations induced by human activity. A containment relation occurs when an object contains or holds another object, obscuring it from view. Contained objects have the same trajectories as the container.

They use an idea that I have thought about as well and think is powerful; rather than only relying on detections for next state hypotheses, they suggest alternative hypotheses based on occlusion reasoning. The other hypotheses come from containment relations and blocked relations; if an object is contained, it inherits the track state of its container. If an object is blocked, its track state is considered stationary (not always true!).

This tracking scenario is unique because they only consider human action as the source of state change (i.e., this approach doesn’t apply to general ped or vehicle tracking).

They use state-of-the-art detection and activity recognition to carry out object tracking. They use the network flow approach for occlusion-aware data association in a sliding window. I think their algorithm has to decide between containment and blocked occlusion events via the activity recognition.

They note that their approach is limited by how good object detection is. I think an important direction to consider is when external forces other than people can act on the object. This requires a lot more complex modeling of what an occluded object might be doing. Also, we need better object detection!

In Zhang, Li, and Nevatia 2008, they use an Explicit Occlusion Model (EOM) in the network flow data association model. Good potential baseline.

Patrick Emami

Plug & Play Directed Evolution of Proteins with Gradient-based Discrete MCMC

Announcing EvoProtGrad

Quick demo

Summary

Checking intuitions with a toy binary MNIST experiment

In-silico directed evolution experiments

Part 1: Deploying a PyTorch MobileNetV2 Classifier on the Intel Neural Compute Stick 2

Getting started

The PyTorch to ONNX Conversion

The ONNX to OpenVINO Conversion

Running the MobileNetV2 IR on the NCS2

References

A Symmetric and Object-centric World Model for Stochastic Environments

Abstract

Demos - BAIR Towel Pick 30K with noisy actions

What Can Neural Networks Reason About?

Summary

What About Transformers?

MLSS 2019

Wait, why are PhD students going to summer school?

It wasn’t all work

Technical highlights from MLSS

Optimization

Additional Resources

Interpretability

Additional Resources

Gaussian Processes

Multivariate normals

Covariance functions

Additional Resources

Kernels

More than just a trick

RKHS

Additional Resources

MCMC

What about variational methods?

Markov chains

Diagnosing your MCMC

Hamiltonian Monte Carlo

Additional Resources

Fairness

Additional Resources

Submodularity

Convex or concave?

Greedy algorithms

Continuous relaxation for minimization

Additional Resources

Learning From Demonstrations in the Wild

Summary

Observations

Addressing Function Approximation Error in Actor-Critic Methods & Discriminator-Actor-Critic- Addressing Sample Inefficiency and Reward Bias in Adversarial Imitation Learning

Summary

Addressing Function Approx. Error in AC Methods

Discriminator-Actor-Critic: Addressing Sample Inefficiency and Reward Bias in Adversarial Imitation Learning

RUDDER: Return Decomposition for Delayed Rewards

Intro

Some details about RUDDER

Experimental results

Grid world

Charge-discharge

Atari evaluation

Closing thoughts

Z-Forcing: Training Stochastic Recurrent Networks

Intro

Stochastic recurrent networks

Z-forcing

Experiments

Takeaways

Tracking Occluded Objects and Recovering Incomplete Trajectories by Reasoning about Containment Relations and Human Actions

Summary

Interesting related works