Stories by Ryan Rudes on Medium

Rendering OpenAI Gym Environments in Google Colab

Ryan Rudes — Mon, 08 Feb 2021 16:31:11 GMT

Rendering Breakout-v0 in Google Colab with colabgymrender

UPDATE: This package has been updated for compatibility with the new gymnasium library and is now called renderlab. Get it here.

I’ve released a module for rendering your gym environments in Google Colab. Since Colab runs on a VM instance, which doesn’t include any sort of a display, rendering in the notebook is difficult. After looking through the various approaches, I found that using the moviepy library was best for rendering video in Colab. So I built a wrapper class for this purpose, called colabgymrender.

Installation

apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1
pip install -U colabgymrender

Example

https://medium.com/media/85ba1fb1653e1915b03e647a5878ca93/href

Output

https://medium.com/media/2cae3d32695106716e9a2d062e69a8a2/href

Usage

Wrap a gym environment in the Recorder object.

env = gym.make("CartPole-v0")
env = Recorder(env, , )

If you specify a frame rate via , the videos released to will be of that frame rate. Otherwise, the environment will check for the default frame rate specified by the environment itself in env.metadata['video.frames_per_second']. If neither is found, the frame rate will default to 30.

You can pause or resume recording with env.pause() and env.resume(), but make sure to env.resume() to record for at least one frame before calling env.play(). Otherwise, you’ll get an error for trying to play a video in which no frames were recorded.

While recording, each time the environment reaches a terminal state, the videos will automatically be released to . There is no need to manually release recordings. This provides for ease of use without having to manually control the operations of the recorder. Simply record an episode (or part of an episode), then play the content.

More Examples

https://medium.com/media/901eac865e73186edb41d4f159915a10/href https://medium.com/media/dc9b7802544d5c511169bc56f022a448/href https://medium.com/media/b7de4525dede2dfffbf34a5caaca8ae1/href

Links:

Rendering OpenAI Gym Environments in Google Colab was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

Controlling a Mouse With Your Eyes

Ryan Rudes — Mon, 21 Sep 2020 14:38:19 GMT

Mouse automatically navigating to a coordinate according to eye position (Image by author)

A Machine Learning approach to eye pose estimation from just a single front-facing perspective as input

In this project, we’ll write code to crop images of your eyes each time you click the mouse. Using this data, we can train a model in reverse, predicting the position of the mouse from your eyes.

We’ll need a few libraries

# For monitoring web camera and performing image minipulations
import cv2

# For performing array operations
import numpy as np

# For creating and removing directories
import os
import shutil

# For recognizing and performing actions on mouse presses
from pynput.mouse import Listener

Let’s first learn how pynput’s Listener works.

pynput.mouse.Listener creates a background thread that records mouse movements and mouse clicks. Here’s a simplifier code that, upon a mouse press, prints the coordinates of the mouse:

from pynput.mouse import Listener

def on_click(x, y, button, pressed):
  """
  Args:
    x: the x-coordinate of the mouse
    y: the y-coordinate of the mouse
    button: 1 or 0, depending on right-click or left-click
    pressed: 1 or 0, whether the mouse was pressed or released
  """
  if pressed:
    print (x, y)

with Listener(on_click = on_click) as listener:
  listener.join()

Now, let’s expand this framework for our purposes. However, we first need to write the code that crops the bounding box of your eyes. We’ll call this function from within the on_click function later.

We use Haar cascade object detection to determine the bounding box of the user’s eyes. You can download the detector file here. Let’s make a simple demonstration to show how this works:

import cv2

# Load the cascade classifier detection object
cascade = cv2.CascadeClassifier("haarcascade_eye.xml")

# Turn on the web camera
video_capture = cv2.VideoCapture(0)

# Read data from the web camera (get the frame)
_, frame = video_capture.read()

# Convert the image to grayscale
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

# Predict the bounding box of the eyes
boxes = cascade.detectMultiScale(gray, 1.3, 10)

# Filter out images taken from a bad angle with errors
# We want to make sure both eyes were detected, and nothing else
if len(boxes) == 2:
  eyes = []
  for box in boxes:
    # Get the rectangle parameters for the detected eye
    x, y, w, h = box
    # Crop the bounding box from the frame
    eye = frame[y:y + h, x:x + w]
    # Resize the crop to 32x32
    eye = cv2.resize(eye, (32, 32))
    # Normalize
    eye = (eye - eye.min()) / (eye.max() - eye.min())
    # Further crop to just around the eyeball
    eye = eye[10:-10, 5:-5]
    # Scale between [0, 255] and convert to int datatype
    eye = (eye * 255).astype(np.uint8)
    # Add the current eye to the list of 2 eyes
    eyes.append(eye)

  # Concatenate the two eye images into one
  eyes = np.hstack(eyes)

Now, let’s use this knowledge to write a function for cropping the eye image. First, we’ll need a helper function for normalization:

def normalize(x):
  minn, maxx = x.min(), x.max()
  return (x - minn) / (maxx - minn)

Here’s our eye cropping function. It returns the image if the eyes were found. Otherwise, it returns None:

def scan(image_size=(32, 32)):
  _, frame = video_capture.read()

  gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
  boxes = cascade.detectMultiScale(gray, 1.3, 10)

  if len(boxes) == 2:
    eyes = []
    for box in boxes:
      x, y, w, h = box
      eye = frame[y:y + h, x:x + w]
      eye = cv2.resize(eye, image_size)
      eye = normalize(eye)
      eye = eye[10:-10, 5:-5]
      eyes.append(eye)

    return (np.hstack(eyes) * 255).astype(np.uint8)
  else:
    return None

Now, let’s write our automation, which will run each time we press the mouse button. (assume we have already defined the variable root previously in our code as the directory where we would like to store the images):

def on_click(x, y, button, pressed):
  # If the action was a mouse PRESS (not a RELEASE)
  if pressed:
    # Crop the eyes
    eyes = scan()
    # If the function returned None, something went wrong
    if not eyes is None:
      # Save the image
      filename = root + "{} {} {}.jpeg".format(x, y, button)
      cv2.imwrite(filename, eyes)

Now, we can recall our implementation of pynput’s Listener, and make our full code implementation:

https://medium.com/media/637e8ebfdf882fc583c652d7deaedde3/href

When we run this, each time we click the mouse (if both of our eyes are in view), it will automatically crop the webcam and save the image to the appropriate directory. The filename of the image will contain the mouse coordinate information, as well as whether it was a right or left click.

Here’s an example image. In this image, I am performing a left-click at coordinate (385, 686) on a monitor with resolution 2560x1440:

An example (Image by author)

The cascade classifier is highly accurate, and I have not seen any mistakes in my own data directory so far.

Now, let’s write the code for training a neural network to predict the mouse position, given the image of your eyes.

Let’s import some libraries

import numpy as np
import os
import cv2
import pyautogui

from tensorflow.keras.models import *
from tensorflow.keras.layers import *
from tensorflow.keras.optimizers import *

Now, let’s add our cascade classifier:

cascade = cv2.CascadeClassifier("haarcascade_eye.xml")
video_capture = cv2.VideoCapture(0)

Let’s add our helper functions.

Normalization:

def normalize(x):
  minn, maxx = x.min(), x.max()
  return (x - minn) / (maxx - minn)

Capturing the eyes:

def scan(image_size=(32, 32)):
  _, frame = video_capture.read()

  gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
  boxes = cascade.detectMultiScale(gray, 1.3, 10)

  if len(boxes) == 2:
    eyes = []
    for box in boxes:
      x, y, w, h = box
      eye = frame[y:y + h, x:x + w]
      eye = cv2.resize(eye, image_size)
      eye = normalize(eye)
      eye = eye[10:-10, 5:-5]
      eyes.append(eye)

    return (np.hstack(eyes) * 255).astype(np.uint8)
  else:
    return None

Let’s define the dimensions of our monitor. You’ll have to change these parameters according to the resolution of your own computer screen:

# Note that there are actually 2560x1440 pixels on my screen
# I am simply recording one less, so that when we divide by these
# numbers, we will normalize between 0 and 1. Note that mouse
# coordinates are reported starting at (0, 0), not (1, 1)
width, height = 2559, 1439

Now, let’s load in our data (again, assuming you already defined root). We don’t really care whether it was a right or left click, because our goal is just to predict the mouse position:

filepaths = os.listdir(root)
X, Y = [], []

for filepath in filepaths:
  x, y, _ = filepath.split(' ')
  x = float(x) / width
  y = float(y) / height
  X.append(cv2.imread(root + filepath))
  Y.append([x, y])

X = np.array(X) / 255.0
Y = np.array(Y)
print (X.shape, Y.shape)

Let’s define our model architecture:

model = Sequential()
model.add(Conv2D(32, 3, 2, activation = 'relu', input_shape = (12, 44, 3)))
model.add(Conv2D(64, 2, 2, activation = 'relu'))
model.add(Flatten())
model.add(Dense(32, activation = 'relu'))
model.add(Dense(2, activation = 'sigmoid'))
model.compile(optimizer = "adam", loss = "mean_squared_error")
model.summary()

Here’s our summary:

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
conv2d (Conv2D)              (None, 5, 21, 32)         896
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 2, 10, 64)         8256
_________________________________________________________________
flatten (Flatten)            (None, 1280)              0
_________________________________________________________________
dense (Dense)                (None, 32)                40992
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 66
=================================================================
Total params: 50,210
Trainable params: 50,210
Non-trainable params: 0
_________________________________________________________________

Let’s train our model. We’ll add some noise to the image data:

epochs = 200
for epoch in range(epochs):
  model.fit(X, Y, batch_size = 32)

Not, let’s use our model to move the mouse with our eyes live. Note that this requires a lot of data to work well. However, as a proof of concept, you’ll notice that with just around 200 images, it does, infact, move the mouse to the general region you are looking at. It’s certainly not controllable until you have much more data though.

while True:
  eyes = scan()

  if not eyes is None:
      eyes = np.expand_dims(eyes / 255.0, axis = 0)
      x, y = model.predict(eyes)[0]
      pyautogui.moveTo(x * width, y * height)

Here’s a proof-of-concept example. Note that I trained with very little data before taking this screen recording. This is a video of my mouse automatically moving to the Terminal application window according to my eyes. As I said, it’s jumpy because there’s very little data. With much more data, it will hopefully be stable enough to control with higher specificity. With just a few hundred images, you’ll only be able to move it to within the general region of your gaze. Also, if throughout your data collection, no images were taken of you looking at a particular region of the screen (say, the edges), the model is unlikely to ever predict within this region. This is one of the many reasons we need more data.

eye_mouse_movement.mp4

If you are testing the code yourself, remember to change the values of width and height to your monitor’s resolution in the code file prediction.py.

You can view the code from this tutorial here:

Ryan-Rudes/eye_mouse_movement

Controlling a Mouse With Your Eyes was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

An Introductory Reinforcement Learning Project: Learning Tic-Tac-Toe via Self-Play Tabular…

Ryan Rudes — Tue, 08 Sep 2020 17:30:03 GMT

An Introductory Reinforcement Learning Project: Learning Tic-Tac-Toe via Self-Play Tabular Q-learning

In this project, I’ll walk through an introductory project on tabular Q-learning. We’ll train a simple RL agent to be able to evaluate tic-tac-toe positions in order to return the best move by playing against itself for many games.

First, let’s import the required libraries

https://medium.com/media/a784d1b98d80a7b51fd99be3c06f8865/href

Note that tabular q-learning only works for environments which can be represented by a reasonable number of actions and states. Tic-tac-toe has 9 squares, each of which can be either an X, and O, or empty. Therefore, there are approximately 3⁹ = 19683 states (and 9 actions, of course). Therefore, we have a table with 19683 x 9 = 177147 cells. This is not small, but it is certainly feasible for tabular q-learning. In fact, we could exploit the fact that the game of tic-tac-toe is unchanged by rotations of the board. Therefore, there are actually far fewer “unique states”, if you consider rotations and reflections of a particular board configuration the same. I won’t get into deep Q-learning, because this is intended to be an introductory project.

First, we initialize our q-table with the aforementioned shape:

https://medium.com/media/360a2fac0956f451ef3d695fdfa2b59e/href

Now, let’s set some hyperparameters for training:

https://medium.com/media/504e132db1ac67a0e58921498d080765/href

Now, we need to set up an exploration strategy. Assuming you understand exploration-exploitation in RL, the exploration strategy is the way we will gradually decrease epsilon (the probability of taking random actions). We need to initially play at least semi-randomly in order to properly explore the environment (the possible tic-tac-toe board configurations). But we cannot forever take random actions, because RL is an iterative process that relies under the assumption that the evaluation of future reward gets better over time. If we simply played random games forever, we would be trying to associate a random list of actions with some final game result that has no actual dependency upon any particular action we took.

https://medium.com/media/fdd47ee186d096d38291435f9ccfb9f3/href

Now, let’s create a graph of epsilon vs. episodes (number of games simulated) with matplotlib, saving the figure to an image file:

https://medium.com/media/f6773c30ca19130ba7af0a3cd0e7b823/href

When we start to simulate games, we need to set some restrictions so that the agents can’t make insensible moves. In tic-tac-toe, occupied squares are no longer available, so we need a function to return the legal moves, given a board configuration. We will be representing our board by a 3x3 NumPy array, where unoccupied squares are 0, X’s are 1, and O’s are -1. We can use NumPy’s np.argwhereto retrieve the indices of the 0 elements.

https://medium.com/media/74631fe9ceee9de5558e89fb617a23db/href

We also need a helper function to convert between a 3x3 board representation and an integer state. We’re storing the future reward estimations in a q-table, so we need to be able to index any particular board configuration with ease. My algorithm for converting between the board in the format I previously described works by partitioning the total number of possible states into a number of sections corresponding to the number of actions. For each cell in the board:

If the cell is -1, you don’t change state
If the cell is 0, you change state by one-third of the window size
If the cell is 1, you change state by two-thirds of the window size.

https://medium.com/media/4061d69b5821b802f6e1d1702becd9d1/href

Finally, we need one last helper function to determine when the game has reached a terminal state. This function also needs to return the result of the game if it is indeed over. My implementation checks the rows, columns, and diagonals for a series of either 3 consecutive 1's or 3 consecutive -1’s by taking the sum of the board array across each axis. This produces 3 sums, one for each axis or column. If -3 is one of these sums, this axis must have all -1’s, indicating that the player corresponding to -1 won and vice versa. The diagonals work just the same, except there are only 2 diagonals, while there are 3 rows and 3 columns. My original implementation is a bit naive, I found a much better one online. It’s much shorter and improves speed slightly.

https://medium.com/media/953c3a10b08814ac13c09f3bfac10daf/href https://medium.com/media/e3bf8a1054d70d3f12ffed024f6f562f/href

Now, let’s initialize some lists to record training metrics.

https://medium.com/media/69e65d949fece60edb49eae783db43cf/href

past_results will store the results of each simulated game, with 0 representing a tie, 1 indicating that the player corresponding to the positive integer won, and vice versa with -1.

win_probs will store a list of percentages, updated after each episode. Each value tells the fraction of games up to the current episode in which either player has won. draw_probs also records percentages, but corresponding to the fraction of games in which a draw occurred.

After training, if we were to graph win_probs and draw_probs, they should demonstrate the following behavior.

Early in training, the win probability will be high, while the draw probability will be low. This is because when both opponents are taking random actions in a game like tic-tac-toe, there will more often be wins than draws simply due to the existence of a larger number of win states than draw states.
Mid-way through training, when the agent begins to play according to its table’s policy, the win and draw probabilities will fluctuate with symmetry across the 50% line. Once the agent starts playing competitively against itself, it will encounter more draws, as both sides are playing according to the same strategic policy. Each time the agent discovers a new offensive strategy, there will be a fluctuation in the graph, for the agent is able to trick its opponent (itself) for a short period of time.
After fluctuating for a while, draw probabilities should approach 100%. If the agent was truly playing optimally against itself, it would always encounter a draw, for it is attempting to maximize reward according to a table of expected future rewards… the same table being used by the opponent (itself).

Let’s write the training script. For each episode, we begin at a non-terminal state: an empty 3x3 board filled with 0’s. It each move, with some probability epsilon, the agent takes a random action from the list of available squares. Otherwise, it looks up the row of the q-table corresponding to the current state and selects the action which maximizes the expected future reward. The integer representation of the new board state is computed, and we record the pair (s, a, s’). Once this game ends, we will need to correlate the state-action pair we just observed with the final game results (which are yet-to-be-determined). Once the game ends, we refer back to each recorded state-action pair, and update the corresponding cell of the q-table according to the following:

Q-learning update rule

In the above update formula, s is the integer representation of the state, a is the integer representation of the action the agent took at state s , alpha is the learning rate, R(s, a) is the reward (in our case, the end result of the corresponding game in which this pair (s, a) was observed), Q is the q-table, and the statement involving max represents the maximum expected reward for the resulting state. Say the board configuration was:

[[0, 0, 0],
 [0, 0, 0],
 [0, 0, 1]]

and we took the action 3, corresponding to the cell at coordinate (1, 0), the resulting state would be:

[[0  0, 0],
 [-1, 0, 0],
 [0, 0, 1]]

This part of the update formula is referring to the maximum expected reward for any of the actions we could take from here, according to the policy defined by our current q-table. Therefore, s' is the second state I just described, and a' is all of the actions we could theoretically take from this state (0–8), although in reality, some are illegal (but this is irrelevant).

https://medium.com/media/b401b4ef92eeb053fcd8ebbb0c3b1146/href

At the end of every 1000 episodes, I just save the list of training metrics, and a plot of these metrics. At the end, I save the q-table and the lists storing these training metrics.

Results

I trained mine with Google Colab’s online GPU, but you can train yours locally if you’d like; you don’t necessarily have to train all the way to convergence to see great results.

Just as I previously mentioned, the relationship between games terminating in a win/loss and those terminating in a draw should work as follows:

Earlier in training, an unskilled, randomly-playing agent will frequently encounter win-loss scenarios.
Each time the agent discovers a new strategy, there will be fluctuations.
Towards the end of training, near convergence, the agent will almost always encounter a draw, as it is playing optimally against itself.

Therefore, the larger fluctuations in the graph indicate moments when the agent learned to evaluate a particular board configuration very well, and in doing so this allowed it to prevent draws.

We can see this is clearly demonstrated in the resulting graph:

Win/Loss-Draw Ratio Throughout Training

Throughout the middle of training, it frequently appears as if the q-table will converge, only to quickly change entirely. These are the aforementioned moments when a significant strategy was exploited for the first time.

Also, as you can see, the fluctuations occur more rarely as you progress throughout training. This is due to the fact that there are less yet-to-be-discovered tactics as your progress. Theoretically, if the agent converged, there would never be any more great fluctuations like this. Draws would occur 100% of the time and after the rapid rise in the draw percentage, it would not fall back down again.

I decided it would be a good idea to visualize the change in the q-values over time, so I retrained it while recording the sum of the absolute values of the Q table at each episode. Whether a particular q-value is positive or negative recording the sum of all absolute q-values shows us when convergence is occurring (the gradient of the q-values over time decreases as we reach convergence).

Win/Loss-Draw Ratio Throughout Training + Sum of Absolute Value of Q-table throughout Training

You can visit the full code on Google Colab here:

Google Colaboratory

Or on GitHub here:

https://medium.com/media/cd2ea755c58aa567b82e3c799593dca4/href

Experimenting with the exploration strategy will influence training. You can change parameters relating to epsilon, as well as how it is decayed in order to get different results.

One final thing to note is that Tic-Tac-Toe can be approached much more easily with simpler value iteration methods because the transition matrix is a given, as well as the reward matrix. This sort of epsilon-greedy optimization is really unnecessary for an environment like Tic-Tac-Toe.

An Introductory Reinforcement Learning Project: Learning Tic-Tac-Toe via Self-Play Tabular… was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Style Transfer for Line Drawings

Ryan Rudes — Sun, 06 Sep 2020 22:58:51 GMT

Generating Images From Line Drawings With ML

Here, I’ll walk through a machine learning project I recently did in a tutorial-like manner. It is an approach to generating full images in an artistic style from line drawings.

Dataset

I trained on 10% of the Imagenet dataset. This is a dataset commonly used for benchmarks in computer vision tasks. The Imagenet dataset is not openly available; it is restricted to those undergoing research which requires use of it to compute performance benchmarks for comparing with other approaches. Therefore, it is typically required that you submit a request form. But if you are just using it casually, it is available here. I just wouldn’t use this for beyond anything beyond personal projects. Note that the dataset is very large, which is why I only used 1/10th of it to train my model. It consists of 1000 classes, so I used 100 of these image classes for training.

I used Imagenet for a different personal project a few weeks ago, so I already had a large collection of files in Google Drive. Unfortunately, however, it took approximately 20 hours to upload these 140,000 images or so to Google Drive. It is necessary to train the model on Google Colab’s online GPU, but this requires you to upload the images to Google Drive, as you aren’t hosting your coding environment locally.

Data Input Pipeline

I have a Colab Pro account, but even with the additional RAM, I certainly can’t handle 140,000 line drawing, each of 256x256 pixels in size, along with their 256x256 pixel colored counterparts. Hence, I have to load in the data on-the-go using a TensorFlow data input pipeline.

Before we start to set up the pipeline, let’s import the required libraries (these are all of the import statements in my code):

import matplotlib.pyplot as plt
import numpy as np
import cv2
from tqdm.notebook import tqdm
import glob
import random
import threading, queue

from tensorflow.keras.models import *
from tensorflow.keras.layers import *
from tensorflow.keras.optimizers import *
from tensorflow.keras.regularizers import *
from tensorflow.keras.utils import to_categorical
import tensorflow as tf

Now, let’s load the filepaths which refer to each image in our subset of Imagenet, assuming you have uploaded them to Drive under the appropriate directory structure and connection your Google Colab instance to your Google Drive.

filepaths = glob.glob("drive/My Drive/1-100/**/*.JPEG")

# Shuffle the filepaths
random.shuffle(filepaths)

If you don’t want to use the glob module, you can use functions from theos library, which are often more efficient.

Here’s a few helper functions I need:

Normalizing data
Posterizing image data

def normalize(x):
  return (x - x.min()) / (x.max() - x.min())

Posterization

The aforementioned process of posterization takes an image as input and transforms smooth gradients into more clearly-separated color sections by rounding color values to some nearest value. Here’s an example:

Posterization

As you can see, the resulting image has less smooth gradients, which are replaced with separated color sections. The reason I am implementing this is because I can limit the output images to a set of colors, allowing me to format the learning problem as a classification problem across each pixel in an image. For each available color, I assign a label. The model outputs an image of shape (height, width, num_colors) activated by a softmax function over the last channel, num_colors. Given a variable num_values, I allow all combinations of RGB where the color values are limited to np.arange(0, 255, 255 / num_values). This means that num_colors = num_values ** 3. Here’s an example:

Posterization

For an example of how I implemented this, here’s a demonstration:

def get_nearest_color(color, colors):
  """
  Args:
   - color: A vector of size 3 representing an RGB color
   - colors: NumPy array of shape (num_colors, 3)
  Returns:
   - The index of the color in the provided set of colors that is
     closest in value to the provided color
  """

  return np.argmin([np.linalg.norm(color - c) for c in colors])

def posterize_with_limited_colors(image, colors):
  """
  Args:
   - colors: NumPy array of shape (num_colors, 3)
  Returns:
   - Posterized image of shape (height, width, 1), where each value 
     is an integer label associated with a particular index of the
     provided colors array
  """

  image = normalize(image)
  posterized = np.array([[get_nearest_color(x, colors) for x in y] for y in image])
  return posterized

Edge Extraction

In order to create the input data from our colored images, we need a method of extracting edges from an image which are akin to a trace or line drawing.

We’ll be using the Canny edge detection algorithm. Let’s write our helper function, which inputs the path to an image and output the associated example/(X, Y) training pair, comprised of a posterization of the colored input, alongside the black and white edge extraction:

def preprocess(path):
  color = cv2.imread(path)
  color = cv2.resize(color, input_image_size)

  # Assuming your pipelines generator function ignores None
  if color.shape < input_image_size:
    return None, None

  color = (normalize(color) * 255).astype(np.uint8)

  gray = cv2.cvtColor(color, cv2.COLOR_RGB2GRAY
  # Removes noise while preserving edges
  filtered = cv2.bilateralFilter(gray, 3, 60, 120)

  # Automatically determine threshold for edge detection algorithm
  # based upon the median color value of the image
  m = np.median(filtered)
  preservation_factor = 0.33
  low = max(0, int(m - 255 * preservation_factor))
  high = int(min(255, m + 255 * preservation_factor))
  filtered_edges = cv2.Canny(filtered, low, high)
  filtered_edges = normalize(filtered_edges)
  filtered_edges = np.expand_dims(filtered_edges, axis = -1)

  color = cv2.resize(color, output_image_size)
  color /= 255.
  color = posterize_with_limited_colors(color, colors)

  return filtered_edges, color

The automatic Canny edge detection is just my modification to the small function used in this article.

The Pipeline

As I said, I’m loading in data on-the-spot using an input pipeline. Therefore, I need to define a generator object to load in this data when needed. My generator function is simple because we basically just defined it. All it adds is filtering out the None outputs of the preprocess function (images of lower resolution than input_image_size and filtering out any results containing nan or inf values.

def generate_data(paths):
  for path in paths:
    edges, color = preprocess(path.decode())
    if not edges is None:
      if not np.any(np.isnan(edges)) or np.any(np.isnan(color)):
        if not np.any(np.isinf(edges)) or np.any(np.isinf(color))):
          # Yield the clean data
          yield edges, color

I use (128, 128) for both input_image_size and output_image_size. A 128x128 pixel image isn’t that low-resolution, so there’s no significant disadvantage for our purposes. Also, Imagenet images are typically much higher resolution, so we can go higher if desired.

Now let’s build the pipeline. I’m using multithreading for improved speeds. TensorFlow’s.interleave()allows us to do this:

thread_names = np.arange(0, 8).astype(str)
dataset = tf.data.Dataset.from_tensor_slices(thread_names)

dataset = dataset.interleave(lambda x:
  tf.data.Dataset.from_generator(
    generate_data,
    output_types = (tf.float32, tf.float32),
    output_shapes = ((*input_image_size, 1),
                     (*output_image_size, 1)),
    args = (train_paths,)),
    cycle_length = 8,
    block_length = 8,
    num_parallel_calls = 8)

dataset = dataset.batch(batch_size).repeat()

Testing The Pipeline

Let’s load in a training example through our pipeline:

One training example with input line drawing/edges (right) and output colorization (left)

It’s exactly as desired. Note that the image depicted on the left is not exactly what was outputted by the pipeline. Recall that the pipeline is returning the index referring to the color of each pixel. I simply referred to each associated color to create the visualization. Here’s an example of one that came out much simpler.

Simpler training example

You’ll see that on the left we have the output, posterized color image, which partially resembles a painting. On the right, you see the input edge extraction, which resembles a sketch.

Of course not all training examples will have as good of an edge extraction than others. When the colors are more difficult to separate, the resulting outline might be a little noisy and/or scattered. However, this was the most accurate method for extracting edges I could think of.

Model Architecture

Let’s move on to the model architecture.

I begin at input_image_size = (128, 128), thus making the input of shape (128, 128, 1) after expanding the last axis. I decrease the layer input shape by a power of 2 until it equals 1. Then, I apply two more convolutional layers with stride = 1, because we can’t decrease the shape of the first two axes any further. Then, I perform the reverse with transposed layers. Each convolutional layer has padding = 'valid' and there is a batch normalization layer between each convolutional layer. All convolution layers have ReLU activation, except the last, which of course has softmax activation over the final one-hot-encoded color-label channel.

_________________________________________________________________ Layer (type)                 Output Shape              Param #    ================================================================= input_35 (InputLayer)        [(None, 128, 128, 1)]     0          _________________________________________________________________ conv2d_464 (Conv2D)          (None, 64, 64, 3)         30         _________________________________________________________________ batch_normalization_388 (Bat (None, 64, 64, 3)         12         _________________________________________________________________ conv2d_465 (Conv2D)          (None, 32, 32, 9)         252        _________________________________________________________________ batch_normalization_389 (Bat (None, 32, 32, 9)         36         _________________________________________________________________ conv2d_466 (Conv2D)          (None, 16, 16, 27)        2214       _________________________________________________________________ batch_normalization_390 (Bat (None, 16, 16, 27)        108        _________________________________________________________________ conv2d_467 (Conv2D)          (None, 8, 8, 81)          19764      _________________________________________________________________ batch_normalization_391 (Bat (None, 8, 8, 81)          324        _________________________________________________________________ conv2d_468 (Conv2D)          (None, 4, 4, 243)         177390     _________________________________________________________________ batch_normalization_392 (Bat (None, 4, 4, 243)         972        _________________________________________________________________ conv2d_469 (Conv2D)          (None, 2, 2, 729)         1595052    _________________________________________________________________ batch_normalization_393 (Bat (None, 2, 2, 729)         2916       _________________________________________________________________ conv2d_470 (Conv2D)          (None, 1, 1, 2187)        14351094   _________________________________________________________________ batch_normalization_394 (Bat (None, 1, 1, 2187)        8748       _________________________________________________________________ conv2d_471 (Conv2D)          (None, 1, 1, 2187)        43048908   _________________________________________________________________ batch_normalization_395 (Bat (None, 1, 1, 2187)        8748       _________________________________________________________________ conv2d_472 (Conv2D)          (None, 1, 1, 2187)        43048908   _________________________________________________________________ batch_normalization_396 (Bat (None, 1, 1, 2187)        8748       _________________________________________________________________ conv2d_transpose_229 (Conv2D (None, 1, 1, 2187)        43048908   _________________________________________________________________ batch_normalization_397 (Bat (None, 1, 1, 2187)        8748       _________________________________________________________________ conv2d_transpose_230 (Conv2D (None, 1, 1, 2187)        43048908   _________________________________________________________________ batch_normalization_398 (Bat (None, 1, 1, 2187)        8748       _________________________________________________________________ conv2d_transpose_231 (Conv2D (None, 2, 2, 2187)        43048908   _________________________________________________________________ batch_normalization_399 (Bat (None, 2, 2, 2187)        8748       _________________________________________________________________ conv2d_transpose_232 (Conv2D (None, 4, 4, 2187)        43048908   _________________________________________________________________ batch_normalization_400 (Bat (None, 4, 4, 2187)        8748       _________________________________________________________________ conv2d_transpose_233 (Conv2D (None, 8, 8, 729)         14349636   _________________________________________________________________ batch_normalization_401 (Bat (None, 8, 8, 729)         2916       _________________________________________________________________ conv2d_transpose_234 (Conv2D (None, 16, 16, 243)       1594566    _________________________________________________________________ batch_normalization_402 (Bat (None, 16, 16, 243)       972        _________________________________________________________________ conv2d_transpose_235 (Conv2D (None, 32, 32, 81)        177228     _________________________________________________________________ batch_normalization_403 (Bat (None, 32, 32, 81)        324        _________________________________________________________________ conv2d_transpose_236 (Conv2D (None, 64, 64, 27)        19710      _________________________________________________________________ up_sampling2d_1 (UpSampling2 (None, 128, 128, 27)      0          _________________________________________________________________ batch_normalization_404 (Bat (None, 128, 128, 27)      108        ================================================================= Total params: 290,650,308 Trainable params: 290,615,346 Non-trainable params: 34,962 _________________________________________________________________

Training

Let’s create some lists to store out metrics throughout training.

train_losses, train_accs = [], []

Also, a variable for the number of training epochs

epochs = 100

And here’s our training script

for epoch in range(epochs):
  random.shuffle(filepaths)
  history = model.fit(dataset,
                      steps_per_epoch = steps_per_epoch,
                      use_multiprocessing = True,
                      workers = 8,
                      max_queue_size = 10)

  train_loss = np.mean(history.history["loss"])
  train_acc = np.mean(history.history["accuracy"])

  train_losses = train_losses + history.history["loss"]
  train_accs = train_accs + history.history["accuracy"]

  print ("Epoch: {}/{}, Train Loss: {:.6f}, Train Accuracy: {:.6f}, Val Loss: {:.6f}, Val Accuracy: {:.6f}".format(epoch + 1, epochs, train_loss, train_acc, val_loss, val_acc))

  if epoch > 0:
    fig = plt.figure(figsize = (10, 5))
    plt.subplot(1, 2, 1)
    plt.plot(train_losses)
    plt.xlim(0, len(train_losses) - 1)
    plt.xlabel("Epoch")
    plt.ylabel("Loss")
    plt.title("Loss")
    plt.subplot(1, 2, 2)
    plt.plot(train_accs)
    plt.xlim(0, len(train_accs) - 1)
    plt.ylim(0, 1)
    plt.xlabel("Epoch")
    plt.ylabel("Accuracy")
    plt.title("Accuracy")
    plt.show()

  model.save("model_{}.h5".format(epoch))
  np.save("train_losses.npy", train_losses)
  np.save("train_accs.npy", train_accs)

Style Transfer for Line Drawings was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Wav2Lip: A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild

Ryan Rudes — Fri, 04 Sep 2020 15:11:33 GMT

Wav2Lip Model Architecture (https://arxiv.org/pdf/2008.10010v1.pdf)

This paper proposes Wav2Lip, an adaptation of the SyncNet model, which outperforms all prior speaker-independent approaches towards the task of video-audio lip-syncing.

The authors note that, while prior approaches typically fail to generalize when presented with video of speakers not present in the training set, Wav2Lip is capable of producing accurate lip movements with a variety of speakers.

They continue to summarize the primary intentions of the paper:

Identifying the cause of prior approaches failing to generalize to a variety of speakers.
Resolve said issues by incorporating a powerful lip-sync discriminator.
Propose new benchmarks for evaluating the performance of approaches towards the task of lip-syncing.

Introduction

The authors first point regards the recent boom in the consumption of video and audio content. Alongside this, there is a growing need for audio-video translation across a variety of languages in order to promote accessibility to a greater portion of the public. Thus, there is a significant motivation for applying machine learning to such a task as automated lip-syncing of unconstrained video-audio content.

Unfortunately, however, earlier approaches commonly failed to generalize to a variety of speakers identities, only performing well when evaluated on the small subset of potential speakers which comprised their training set.

Such approaches will fail to meet the rigorous requirements of the aforementioned practical application, where a suitable mode would have to be capable of accurately syncing a variety of speakers.

Due to the demanding requirements of using such an approach in practice, one requires a model which can generalize to a variety of speaker identities. As a result, speaker-independent approaches have arose. These models are trained on thousands of speaker identities. However, even these approaches employed in prior publications fail to meet the expectations of the authors of this work. They acknowledge that prior speaker independent models, while being able to generate accurate lip-syncing on individual static images, are inapplicable to dynamic content. For applications such as the translation of television series and films, it is required that an approach is more generalizable to the varying lip shapes of the speakers in different unconstrained videos.

The authors cite that a video segment approximately 0.05–0.1 seconds out-of-sync is detectable by a human, thus implying a broad challenge with a fine margin for error.

The section concludes with a brief summary of the authors contributions:

They propose Wav2Lip, which significantly outperforms prior approaches.
They introduce a new set of benchmarks/metrics for evaluating the performance of models in this task.
They release their own dataset to evaluate the performance of their approach when presented with unseen video-audio content sampled from the wild.
Wav2Lip is the first speaker-independent approach which frequently matches the accuracy of real synced videos; according to human evaluation, their approach is preferred to existing methods approximately 90% of the time.
They push the FID score on generating synchronous video frames for dubbed videos from 12.87 (LipGAN) to 11.84 (Wav2Lip + GAN), improving the average user-preference from 2.35% (LipGAN) to 60.2% (Wav2Lip + GAN).

Review of Existing Literature

(I’ll limit this section to the author’s reasons for mentioning these papers, and leave out the in-depth information regarding specifically the approaches of these works)

The authors acknowledge several discrepancies between prior approaches and the requirements for an approach to work fully in the real world:

Requiring a large amount of training data for some methods.
Limitations in terms of the extent of vocabulary learned by the model.
Training on datasets with a limited set of vocabulary impedes upon the ability of prior approaches to learn the wide variety of phoneme-viseme mappings.

They continue to argue why prior approaches commonly fail to generate accurate lip-syncing when presented with unseen video content from the wild:

Pixel-level Reconstruction loss is a Weak Judge of Lip-sync: Loss functions incorporated in prior works inadequately penalize inaccurate lip-sync generation.
A Weak Lip-sync Discriminator: The discriminator in the LipGAN model architecture only has a 56% accuracy at detecting off-sync video-audio content, while the discriminator of Wav2Lip is 91% accurate at distinguishing in-sync content from off-sync content on the same test set.

A Lip-sync Expert Is All You Need

Finally, the authors propose their approach, taking into considering both of the above issues in prior works.

Use a pre-trained lip-sync discriminator that is already accurate in detecting out-of-sync video-audio content in raw, unconstrained samples.
Adapt the previously existing SyncNet model for this task. (I won’t go into depth about this, rather, I’ll only emphasize on the Wav2Lip architecture).

Overview of the Wav2Lip Model Architecture

Wav2Lip Model Architecture (https://arxiv.org/pdf/2008.10010v1.pdf)

Terminology

The authors use the following terms to refer to the various sections of their network, which I will continue to use following this section:

Random reference segment: A random sample of a segment of consecutive frames used to identify a particular speaker, providing the network context of the identity specific to the aforementioned speaker.
Identity encoder: Encodes the concatenation of the ground truth frames and a random reference segment, providing visual context for the network to adapt appropriately to any particular speaker.
Speech encoder: Encodes the audio data (self-explanatory).
Face decoder: Decoded the concatenated feature vectors into a series of reconstructed frames.

Methodology

At a high level, Wav2Lip inputs a Mel-spectrogram representation of a particular audio segment alongside a concatenation of the corresponding ground truth frames (with the bottom half masked) and a random reference segment whose speaker confirms to that of the ground truth segment. It reduces this input via convolutional layers to form a feature vector for both the audio and frames input. It then concatenates these feature representations, projecting the resulting matrix onto a segment of reconstructed frames through a series of transposed convolutional layers. There are residual skip connections between layers of the identity encoder and face decoder.

Wav2Lip attempts to fully reconstruct the ground truth frames from their masked copies. We compute L1 reconstruction loss between the reconstructed frames and the ground truth frames. Then, the reconstructed frames are fed through a pretrained “expert” lip-sync detector, while both the reconstructed frames and ground truth frames are fed through the Visual Quality Discriminator. The Visual Quality Discriminator attempts to distinguish between reconstructed frames and ground truth frames to promote the visual generation quality of the frame generator.

Loss Functions

The Generator

The generator aims to minimize the L1 loss between the reconstructed frames $L_g$ and the ground truth frames $L_G$:

where 𝑵 is the generally-accepted notation to denote batch size.

The Lip-Sync Discriminator

For lip-syncing, they implement cosine similarity with binary cross-entropy loss, thus computing the probability that a given two frames. More specifically, loss is computed between the ReLU-activated video and speech embeddings 𝑣 and 𝑠. This results in a list of probabilities, one for each sample, indicating the probability that the corresponding sample is in sync.

where the ReLU activation applied may be described as:

The full expert discriminator loss is computed by taking the cross-entropy of the distribution $P_{sync}$ as follows:

The Visual-Quality Discriminator

The Visual-Quality Discriminator is trained to maximize the following loss:

where the generator loss $L_{gen}$ is formulated as follows:

Accordingly, the generator attempts to minimize the weighted sum of the reconstruction loss, the synchronization loss, and the adversarial loss (recall that we are dealing with two discriminators):

where $s_w$ is the a weighting value which indicates the penalty attributed to synchronization, and $s_g$ is the adversarial loss.

These two disjoint discriminators allow the network to achieve superior synchronization-accuracy and visual generation quality.

Conclusion and Further Reading

That’s it for ”A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild”. If you’d like to read more in-depth, or about the few things I didn’t cover here:

The proposed metric/evaluation system
Benchmark comparisons between Wav2Lip and prior models
Detailed training procedure and hyperparameters used by the authors
Real-world evaluation of Wav2Lip

you can investigate further by reading the paper: https://arxiv.org/pdf/2008.10010v1.pdf

Model architecture image credit to the authors of “A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild”.

Wav2Lip: A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.