Stories by Jacob Bumgarner, Ph.D. on Medium

Breaking it Down: K-Means Clustering

Jacob Bumgarner, Ph.D. — Sun, 06 Nov 2022 19:19:04 GMT

Exploring and visualizing the fundamentals of K-means clustering with NumPy and scikit-learn.

Outline:
1. What is K-Means Clustering?
2. Implementing K-means from Scratch with NumPy
   1. K-means++ Cluster Initialization
   2. K-Means Function Differentiation
   3. Data Labeling and Centroid Updates
   4. Fitting it Together
3. K-Means for Video Keyframe Extraction: Bee Pose Estimation
4. Implementing K-means with scikit-learn
5. Summary
6. Resources

Article Overview

See my GitHub learning-repo for all of the code behind this post.

1. What is K-Means Clustering?

K-means clustering is an algorithm used to classify data into a user-defined number of groups, k. K-means is a form of unsupervised machine learning, meaning that the input data do not have labels prior to running the algorithm.

Clustering data with algorithms such as k-means is valuable for a variety of reasons. Primarily, clustering serves to identify unique groups in unlabeled datasets when building data analytics pipelines. These labels are useful for data inspection, data interpretation, and training AI models. K-means and its variants are used in a variety of contexts, including:

Research. E.g., Categorizing single-cell RNA sequencing results¹
Computer Science. E.g., Clustering emails for spam detection and filtering²
Marketing. E.g., Customer group segmentation for credit card ad targeting³

2. Implementing K-Means from Scratch with NumPy

To gain a fundamental understanding of how k-means works, we will examine each step of the algorithm. We’ll do this with visual explanations and by building a model from scratch with NumPy.

The algorithm and mathematical function behind k-means are beautiful yet relatively simple. Let’s start with an overview:

https://medium.com/media/4886b38b13350dab674ff8a90858ff8c/href

In summary, the k-means algorithm has three steps:

Assign initial cluster center (centroid) positions
Label the data based on the nearest centroid
Move the centroids to the mean position of the newly labeled data points. Go back to step 2 until the cluster centers converge.

Let’s move on to building the model. These are the functions that we’ll need to write in order to use the algorithm:

https://medium.com/media/47a705a2a72170d24a494cd33e8533e9/href

2.1. Cluster Initialization

The first step of the k-means algorithm is for the user to select the number of groups that the data should be clustered into, k.

In the original implementation of the algorithm, once k was selected, the initial positions of the cluster centers (or centroids) would be initialized by randomly selecting k of the input data points as the centroid starting positions.

This approach turned out to be quite inefficient, as the starting centroid positions could end up being randomly close to one another. In 2006, a new and more efficient approach to the centroid initialization process was developed by Arthur and Vassilvitskii⁴. They published their approach in 2007, calling it k-means++.

Rather than randomly selecting the initial centroids, k-means++ efficiently selects the positions based on distance distributions. Let’s visualize how it works:

https://medium.com/media/a738d68a5b57662ba31312d7ab0401ad/href

Now that the intuition behind k-means++ has been exposed, let’s implement the function for it:

https://medium.com/media/b20204a4e63aca8d0cfbea0bf88f4187/href

Of note, rather than having to choose k manually, several unbiased techniques can be used to identify an optimal number. Khyati Mahendru explains two of these approaches, the elbow and silhouette methods in her article. It’s worth a read!

2.2. Data Labeling and Centroid Updates

Following centroid initialization, the algorithm enters an iterative process of data labeling and centroid position updates.

In each iteration, the input data will first be labeled based on their proximity to the centroids. After this, each centroid’s position will be updated to the average position of the data in its cluster.

These two steps will be repeated until the label assignments/centroid positions no longer change (or converge). Let’s visualize this process:

https://medium.com/media/2e7c6ebd446fb25017aae247bbfa1520/href

Now, let’s implement the data labeling code:

https://medium.com/media/92f09baef4344e104234f7df37257c08/href

And lastly, we’ll implement the centroid position update function:

https://medium.com/media/46e36b1b7ecbea176c0dd2557eb94cfd/href

2.3. K-Means Function Differentiation

The third step of the k-means algorithm is to update the position of the centroids. We saw that these centroids are updated to the average position of all of the cluster’s labeled points.

Updating the centroid to the average cluster position might seem intuitive, but what is the mathematical rationale behind this step? The rationale lies in the differentiation of the k-means equation.

Let’s expose this intuition by exploring an animated proof of the k-means function differentiation. This proof demonstrates that the positional updates are a result of the k-means equation aiming to minimize the within-group variance.

https://medium.com/media/baa4bbfc957bc0fc09ba068df761ade7/href

2.4 Fitting it Together

Now that we’ve constructed the backbone functions for our k-means model, let’s tie it together in a single fit function that will fit our model to the input data. We will also define the __init__ function here:

https://medium.com/media/27e4784240842d33c77061490127def4/href https://medium.com/media/c28d9bd5476b210f690fdad0f3f0328a/href

Now we can put the model to use with my walkthrough notebook found here. This notebook uses synthetically generated data (shown in the videos above) to demonstrate the functionality of our newly written k_means.py code.

3. K-Means for Video Keyframe Extraction: Bee Pose Estimation

Wonderful — we’ve worked our way through the construction of a k-means model entirely from scratch. Rather than just tossing that code aside, let’s use it in an example scenario.

Over the past few years, there have been impressive advancements in the neuroscience & DL research communities that have enabled highly accurate and automated animal behavioral tracking and analysis*. The frameworks used in this research domain implement a variety of convolutional neural network architectures. The models also lean heavily on transfer learning to reduce the amount of training data that researchers need to generate. Two popular examples of these frameworks include DeepLabCut and SLEAP.

* Side note: this subdomain is commonly dubbed computational neuroethology

To train models for automated tracking of specific points on animals, researchers typically have to manually label 100–150 unique frames from their behavioral videos. All things considered, this is a pretty small number that enables automated tracking of indefinitely long behavioral videos!

However, an important aspect that researchers must consider when labeling these training frames is that they should be as unique as possible from one another. It would be extremely aimless to label the first 5 seconds of a single video if hours and hours of recordings exist. This is because the behavior and body states of the animals in the first 5 seconds will likely not accurately represent the features of the entire video dataset. As such, the model would not be trained to effectively recognize a variety of features.

So what does this have to do with k-means? Rather than having to manually identify unique keyframes from the videos, algorithms such as k-means can be implemented to automatically cluster the video frames into unique groups. Let’s visualize how this works:

https://medium.com/media/da6990d02ad7b15f66f7369e6ccc6581/href

To get a hands-on understanding of this process, you can follow along with the code used to isolate these frames with my walkthrough notebook.

4. Implementing K-means with scikit-learn

In the real world, one should generally avoid implementing self-constructed algorithms unless necessary. Instead, we should rely on carefully and efficiently designed frameworks that are maintained by expert paid and volunteer contributors.

In this instance, let’s see how easy it is to implement k-means with scikit-learn. The documentation for this class can be found here.

https://medium.com/media/c565e8e679c6c028aea4f6a6ab11f5e6/href

The scikit-learn implementation of the model initialization and the fitting is very similar to ours (not a coincidence!), but we got to skip writing ~250 lines of the k_means.py code. Moreover, the scikit-learn framework implements optimized BLAS routines for k-means that make their implementation much faster than ours.

Long story short — learning from scratch is invaluable, but working from scratch isn’t.

5. Summary

In this post, we explore the fundamentals of the math and intuition behind the k-means algorithm. We built a k-means model from scratch using NumPy, used it to extract unique keyframes from an animal behavior video, and learned how to implement k-means with scikit-learn.

I hope this article was valuable for you! Feel free to reach out to me with any comments, ideas, or questions.

6. Resources

References:
1. Hicks SC, Liu R, Ni Y, Purdom E, Risso D (2021). mbkmeans: Fast clustering for single cell data using mini-batch k-means. PLoS Comput Biol 17(1): e1008625.
2. Sharma A, Rastogi V (2014). Spam Filtering using K mean Clustering with Local Feature Selection Classifier. Int J Comput ApplMB means 108: 35-39.
3. Muhammad Shahzad, Bank Customer Segmentation (PCA-KMeans)
4. Arthur D, Vassilvitskii S (2006). k-means++: The Advantages of Careful Seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics Philadelphia, PA, USA. pp. 1027–1035

Educational Resources:
- Google Machine Learning: Clustering
- Andrew Ng, CS229 Lecture Notes, K-Means
- Chris Piech, K-Means

Breaking it Down: K-Means Clustering was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Breaking it Down: Logistic Regression

Jacob Bumgarner, Ph.D. — Fri, 19 Aug 2022 08:54:07 GMT

Exploring the fundamentals of logistic regression with NumPy, TensorFlow, and the UCI Heart Disease Dataset

Logistic Regression Overview. Image by Author.

Outline:
1. What is Logistic Regression?
2. Breaking Down Logistic Regression
   1. Linear Transformation
   2. Sigmoid Activation
   3. Cross-Entropy Loss Function
   4. Gradient Descent
   5. Fitting the Model
3. Learning by Example with the UCI Heart Disease Dataset
4. Training and Testing Our Classifier
5. Implementing Logistic Regression with TensorFlow
6. Summary
7. Notes and Resources

1. What is Logistic Regression?

Logistic regression is a supervised machine learning algorithm that creates classification labels for sets of input data (1, 2). Logistic regression (logit) models are used in a variety of contexts, including healthcare, research, and business analytics.

Understanding the logic behind logistic regression can provide strong foundational insight into the basics of deep learning.

In this article, we’ll break down logistic regression to gain a fundamental understanding of the concept. To do this, we will:

Explore the fundamental components of logistic regression and build a model from scratch with NumPy
Train our model on the UCI Heart Disease Dataset to predict whether adults have heart disease based on their input health data
Build a ‘formal’ logit model with TensorFlow

You can follow the code in this post with my walkthrough Jupyter Notebook and Python script files in my GitHub learning-repo.

2. Breaking Down Logistic Regression

Logistic regression models create probabilistic labels for sets of input data. These labels are often binary (yes/no).

Let’s work through an example to highlight the major aspects of logistic regression, and then we’ll start our deep dive:

Imagine that we have a logit model that’s been trained to predict if someone has diabetes. The input data to the model are a person’s age, height, weight, and blood glucose. To make its prediction, the model will transform these input data using the logistic function. The output of this function will be a probabilistic label between 0 and 1. The closer this label is to 1, the greater the model’s confidence that the person has diabetes, and vice versa.

Importantly: to create classification labels, our diabetes logit model first had to learn how to weigh the importance of each piece of input data. It’s probable that someone’s blood glucose should be weighted higher than their height for predicting diabetes. This learning occurred using a set of labeled test data and gradient descent. The learned information is stored in the model in the form of Weights and bias parameter values used in the logistic function.

This example provided a satellite-view outline of what logistic regression models do and how they work. We’re now ready for our deep dive.

To start our deep dive, let’s break down the core component of logistic regression: the logistic function.

https://medium.com/media/4365eb8af5099983c338aef43187d6fc/href

Rather than just learning from reading alone, we’ll build our own logit model from scratch with NumPy. This will be the model’s outline:

https://medium.com/media/02e1d2b4b2005e20c3bf027cef4f3f37/href

In sections 2.1 and 2.2, we’ll implement the linear and sigmoid transformation functions.

In 2.3 we’ll define the cross-entropy cost function to tell the model when its predictions are ‘good’ and ‘bad’. In section 2.4 we’ll help the model learn its parameters via gradient descent.

Finally, in section 2.5, we’ll tie all of these functions together.

2.1 Linear Transformation

As we saw above, the logistic function first applies a linear transformation to the input data using its learned parameters: the Weights and bias.

The Weights (W) parameters indicate how important each piece of input data is to the classification. The closer an individual weight is to 0, the less important the corresponding piece of data is to the classification. The dot product of the Weights vector and input data X flattens the data into a single scalar that we can place onto a number line.

For example, if we’re trying to predict whether someone is tired based on their height and the hours they’ve spent awake, the weight for that person’s height would be very close to zero.

The bias (b) parameter is used to shift this scalar along the decision boundary of this line (0).

Let’s visualize how the linear component of the logistic function uses its learned weights and bias to transform input data from the UCI Heart Disease Dataset.

https://medium.com/media/ca23c8527f2cc6451f6b9ce55944d4dc/href

We’re now ready to start populating our model’s functions. To start, we need to initialize our model with its Weights and bias parameters. The Weights parameter will be an (n, 1) shaped array, where n is equal to the number of features in the input data. The bias parameter is a scalar. Both parameters will be initialized to 0.

https://medium.com/media/6e421c2bc93f6cf513a52d7940400e53/href

Next, we can populate the function to compute the linear portion of the logistic function.

https://medium.com/media/debec00e634955f0ae75d2466abbd3f4/href

2.2 Sigmoid Activation

Logistic models create probabilistic labels (ŷ) by applying the sigmoid function to the output data from the logistic function’s linear transformation. The sigmoid function is useful to create probabilities from input data because it squishes input data to produce values between 0 and 1.

The sigmoid function is the inverse of the logit function, hence the name, logistic regression.

To create binary labels from the output of the sigmoid function, we define our decision boundary to be 0.5. This means that if ŷ ≥ 0.5, we say the label is positive, and when ŷ < 0.5, we say the label is negative.

Let’s visualize how the sigmoid function transforms the input data from the linear component of the logistic function.

https://medium.com/media/47ad3516133187481f9bda682b79e915/href

Now, let’s implement this function into our model.

https://medium.com/media/11021212882c0c33a2cf5cf039bd788e/href

2.3 Cross-Entropy Cost Function

To teach our model how to optimize its Weights and bias parameters, we will feed in training data. However, for the model to learn optimal parameters, it must know how to tell if its parameters did a ‘good’ or ‘bad’ job at producing probabilistic labels.

This ‘goodness’ factor, or the difference between the probability label and the ground-truth label, is called the loss for individual samples. We operationally say that losses should be high if the parameters did a bad job at predicting the label and low if they did a good job.

The losses across the training data are then averaged to create a cost.

The function that has been adopted for logistic regression is the Cross-Entropy Cost Function. In the function below, Y is the ground-truth label, and A is our probabilistic label.

Cross-Entropy Cost Function

Notice that the function changes based on whether y is 1 or 0.

When y = 1, the function computes the log of the label. If the prediction is correct, the loss will be 0 (i.e., log(1) = 0). If it’s incorrect, the loss will get larger and larger as the prediction approaches 0.
When y = 0, the function subtracts 1 from y and then computes the log of the label. This subtraction keeps the loss low for correct predictions and high for incorrect predictions.

Cross-Entropy Cases for 1 and 0 Ground-Truth Labels

Let’s now populate our function to compute the cross-entropy cost for an input data array.

https://medium.com/media/554742ecd29657ed1972d97a68d71b66/href

2.4 Gradient Descent

Now that we can compute the cost of the model, we must use the cost to ‘tune’ the model’s parameters via gradient descent. If you need a refresher on gradient descent, check out my Breaking it Down: Gradient Descent post.

Let’s create a fake scenario: imagine that we are training a model to predict if an adult is tired. Our fake model only gets two input features: height and hours spent awake. To accurately predict if an adult is tired, the model should probably develop a very small weight for the height feature, and a much larger weight for the hours spent awake feature.

Gradient descent will step these parameters down their gradient such that their new values will produce smaller costs. Remember, gradient descent minimizes the output of a function. We can visualize our imaginary example below.

Example Gradient Descent

To compute the gradient of the cost function w.r.t. the Weights and the bias, we’ll have to implement the chain rule. To find the gradients of our parameters, we’ll differentiate the cost function and the sigmoid function to find their product. We’ll then differentiate the linear function w.r.t the Weights and bias function separately.

Let’s explore a visual proof of partial differentiation for logistic regression:

https://medium.com/media/bcfd636b5010a13249fb74374b32aea9/href

Let’s implement these simplified equations to compute the average gradients for each parameter across the training examples.

https://medium.com/media/f1276e7c7dc99893dd4056b3a0636440/href

2.5 Fitting the Model

Finally, we’ve constructed all of the necessary components for our model, so now we need to integrate them. We’ll create a function that is compatible with both batch and mini-batch gradient descent.

In batch gradient descent, every training sample is used to update the model’s parameters.
In mini-batch gradient descent, a random portion of the training samples is selected to update the parameters. Mini-batch selection isn’t that important here, but it’s extremely useful when training data are too large to fit into the GPU/RAM.

As a reminder, fitting the model is a three-step iterative process:

Apply linear transformation to input data with the Weights and Bias
Apply non-linear sigmoid transformation to acquire a probabilistic label.
Compute the gradients of the cost function w.r.t W and b and step these parameters down their gradients.

Let’s build the function!

https://medium.com/media/a79540be0ee75dd2904b945cb00a2b55/href

3. Learning by Example with the UCI Heart Disease Dataset

To make sure we’re not just creating a model in isolation, let’s train the model with an example human dataset. In the context of clinical health, the model we’ll train could improve physician awareness of patient health risks.

Let’s learn by example with the UCI Heart Disease Dataset.

The dataset contains 13 features about the cardiac and physical health of adult patients. Each sample is also labeled to indicate whether the subject does or does not have heart disease.

To start, we’ll load the dataset, inspect it for missing data, and examine our feature columns. Importantly, the labels are reversed in this dataset (i.e., 1=no disease, 0=disease) so we’ll have to fix that.

https://medium.com/media/4c79bef9ebd6bc87cef41da49f2d5765/href

Number of subjects: 303
Percentage of subjects diagnosed with heart disease:  45.54%
Number of NaN values in the dataset: 0

Let’s also visualize the features. I’ve created custom figures, but see my gist here to create your own with Seaborn.

From our inspection, we can conclude that there are no obvious missing features. We can also see that there are some stark group separations in several of the features, including age (age), exercise-induced angina (exang), chest pain (cp), and ECG shapes during exercise (oldpeak & slope). These data will be good to train a logit model!

To conclude this section, we’ll finish preparing the dataset. First, we’ll do a 75/25 split on the data to create test and train sets. Then we’ll standardize* the continuous features listed below.

to_standardize = ["age", "trestbps", "chol", "thalach", "oldpeak"]

https://medium.com/media/19fb8d5a7bf729bf027a034c7bac46d7/href

*You don’t have to standardize data for logit models unless you’re running some form of regularization. I do it here just as a best practice.

4. Training and Testing Our Classifier

Now that we’ve built the model and prepared our dataset, let’s train our model to predict health labels.

We’ll instantiate the model, train it with our x_train and y_train data, and we'll test it with the x_test and y_test data.

https://medium.com/media/229cbafe2e882a6e9719ff68aaac256b/href

Final model cost: 0.36
Model test prediction accuracy: 86.84%

And there we have it: a test set accuracy of 86.8%. This is much better than a 50% random chance, and for such a simple model, the accuracy is quite high.

To inspect things a bit more closely, let’s visualize the model’s features during its training. On the top row, we can see the model’s cost and accuracy during its training. Then on the bottom row, we can see how the Weights and bias parameters change during training (my favorite part!).

https://medium.com/media/73490b39b3589ae3cff01b700a04f303/href

5. Implementing Logistic Regression with TensorFlow

In the real world, it’s not best practice to build your own model when you need to use one. Instead, we can rely on powerful and well-designed open-source packages like TensorFlow, PyTorch, or scikit-learn for our ML/DL needs.

Below, let’s see how simple it is to build a logit model with TensorFlow and compare its training/test results to our own. We’ll prepare the data, create a single-layer and single-unit model with a sigmoid activation, and we’ll compile it with a binary cross-entropy loss function. Lastly, we’ll fit and evaluate the model.

https://medium.com/media/4f5fb50fd570deb89543f84b55ab19f7/href

Epoch 5000/5000
1/1 [==============================] - 0s 3ms/step - loss: 0.3464 - accuracy: 0.8634

Test Set Accuracy:
1/1 [==============================] - 0s 191ms/step - loss: 0.3788 - accuracy: 0.8553
[0.3788422644138336, 0.8552631735801697]

From this, we can see that the model’s final training cost was 0.34 (compared to our 0.36), and the test set accuracy was 85.5%, very similar to our result above. There are a few minor differences under the hood, but the model performances are very similar.

Importantly, the TensorFlow model was built, trained, and tested in less than 25 lines of code, as opposed to our 200+ lines of code in thelogit_model.py script.

6. Summary

In this post, we’ve explored all of the individual aspects of the logistic regression. We started the post by building a model from scratch with NumPy. We first implemented the linear and sigmoid transformations, implemented the binary-cross entropy loss function, and created a fitting function to train our model with input data.

To understand the purpose of logistic regression, we then training our NumPy model on the UCI Heart Disease Dataset to predict heart disease in patients. We found saw the simple model had an 86% prediction accuracy — pretty impressive.

Finally, after taking the time to learn and understand these fundamentals, we then saw how easy it was to build a logit model with TensorFlow.

In sum, logistic regression is both a useful algorithm for predictive analysis. Understanding this model is a powerful first step in the road of studying deep learning.

Well, that’s a wrap! If you’ve made it this far, thanks for reading. I hope that this post was useful for you to gain some valuable insight into the fundamentals of logistic regression.

7. Notes and Resources

Below are a few questions that I had when initially learning about logistic regression. Maybe they’ll be interesting to you too!

Q1: Isn’t a logistic regression model basically just a single unit of a neural network?

A1: Effectively, yes. We can think of logistic regression models as single-layer, single-unit neural networks. Sebastian Raschka provides some nice insight into why this is so. Many neural networks use sigmoid activation functions to generate unit outputs, just as logistic regression does.

Q2: What do we mean by logistic?

A2: The ‘logistic’ of logistic regression comes from the fact that the model uses the inverse of the logit function, aka the sigmoid function.

Resources
- UCI Heart Disease Dataset
- Speech and Language Processing. Daniel Jurafsky & James H. Martin.
- CS229 Lecture notes, Andrew Ng
- Manim, 3Blue1Brown

All images unless otherwise noted are by the author.

Breaking it Down: Logistic Regression was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Breaking it Down: Gradient Descent

Jacob Bumgarner, Ph.D. — Mon, 25 Jul 2022 20:26:23 GMT

Exploring and visualizing the mathematical fundamentals of gradient descent from scratch with Grad-Descent-Visualizer.

https://medium.com/media/669c8fa61ea6412bb136741b54b0ac27/href

Outline
1. What is Gradient Descent?
2. Breaking Down Gradient Descent
  2.1 Computing the Gradient
  2.2 Descending the Gradient
3. Visualizing Multivariate Descents with Grad-Descent-Visualizer
  3.1 Descent Montage
4. Conclusion: Contextualizing Gradient Descent
5. Resources

1. What is Gradient Descent?

Gradient descent is an optimization algorithm that is used to improve the performance of deep/machine learning models. Over a repeated series of training steps, gradient descent identifies optimal parameter values that minimize the output of a cost function.

In the next two sections of this post, we’ll step down from this satellite-view description and break down gradient descent into something a bit easier to understand. We will also visualize the gradient descent of various test functions with my Python package, grad-descent-visualizer.

https://medium.com/media/901068de98325afb8fc30c0346c9f0d4/href

2. Breaking Down Gradient Descent

To gain an intuitive understanding of gradient descent, let’s first ignore machine learning and deep learning. Let’s instead start with a simple function:

A simple univariate function

The goal in gradient descent is to find the minima of a function or the lowest possible output value of that function. This means that given our above function f(x), the goal of gradient descent will be to find the value of x that leads the output of f(x) to approach 0. By visualizing this function (below), it’s quite obvious to see that x = 0 produces the minima of f(x).

https://medium.com/media/1869e2f95cc200da0e1a148a51563efb/href

The important part of gradient descent is: if we initialize x to some random number, say x = 1.8, is there some way to automatically update x so that it eventually produces the minimal output of the function? Indeed, we can automatically find this minimal output with a two-step process:

Find the slope of the function at the point where our input parameter x sits.
Update our input parameter x by stepping it down the gradient.

In our simple gradient descent algorithm, this two-step process is repeated over and over until the output of our function stabilizes at a minimum, or reaches a defined gradient tolerance level. Of note, other more efficient descent algorithms take different approaches (e.g., RMSProp, AdaGrad, Adam).

2.1. Computing the gradient

To find the slope (or gradient, hence gradient descent) of the function f(x) at any value of x, we can differentiate* the function. Differentiating the simple example function is simple with the power rule (below), providing us with: f’(x) = 2x.

Power Rule

Using our starting point x = 1.8, we find our starting gradient of x (dx) to be dx = 3.6.

Let’s write a simple function in python to automatically compute the derivative of any input variable for f(x) = x².

*I’d strongly recommend checking out 3Blue1Brown’s video to intuitively understand differentiation. The differentiation of this sample function from first principles can be seen here.

https://medium.com/media/36d794a90d21f550724cac98fd94bb2a/href

Gradient at x = 1.8: dx = 3.6

2.2. Descending the gradient

Once we find the gradient of the starting point, we want to update our input parameter so that it steps down this gradient. Doing this will minimize the output of the function.

To move a variable down its gradient, we can simply subtract the gradient from the input parameter. However, if you’ve looked closely, you may have noticed that subtracting the entire gradient from the input parameter x=1.8 would cause it to infinitely bouncing back and forth between 1.8 and -1.8, preventing it from ever coming close to 0.

Instead, we can define a Learning Rate = 0.1. We’ll scale the dx with this learning rate before subtracting it from x. By tuning the learning rate, we can create ‘smoother’ descents. Large learning rates produce large jumps along the function, and small learning rates lead to small steps along the function.

Lastly, we’ll eventually have to stop the gradient descent. Otherwise, the algorithm would continue endlessly as the gradient approaches 0. For this example, we’ll simply stop the descent once dx is less than 0.01. In your own IDE, you can alter the learning_rate and tolerance parameters to see how the iterations and the output of x vary.

https://medium.com/media/40a69e5304d24f1c052528096c16dc37/href

Function minimum found in 27 iterations. X = 0.00

As seen in the video above, our starting value of x = 1.8 was able to automatically be updated to x = 0.0 through the iterative process of gradient descent.

3. Visualizing Multivariate Descents with Grad-Descent-Visualizer

Hopefully, this univariate example provided some foundational insight into what gradient descent actually does. Now let’s expand to the context of multivariate functions.

We’ll first visualize a gradient descent of Himmelblau’s function.

Himmelblau’s Function

There are a few key differences in the descent of multivariate functions.

First, we need to compute partial derivatives to update each parameter. In Himmelblau’s function, the gradient of x depends on y (their sums are squared, requiring the chain rule). This means that the formula used to differentiate x will contain y and vice versa.

Second, you may have noticed that there was only one minimum in the simple function from Section 2. In reality, there may be many unknown local minima in our cost functions. This means that the local minima that our parameters find will depend on their starting positions and the behavior of the gradient descent algorithm.

To visualize the descent of this landscape, we’re going to initialize our starting parameters as x = -0.4 and y = -0.65. We can then watch the descent of each parameter in its own dimension and a 2D descent, sliced by the position of the opposite parameter.

https://medium.com/media/69524201d50ad8104f51a6486474f450/href

For greater context, let’s visualize the descent of the same point in 3D using my grad-descent-visualizer package created with the help of PyVista.

https://medium.com/media/46d226db6ba00b312525ac68805e5a0a/href

3.1 Descent Montage

Now let’s visualize the descent of some more test functions! We’ll place a grid of points across each of these functions and watch how the points move as they descend whatever gradient they are sitting on.

The Sphere Function.

https://medium.com/media/ca19aa708038e76712ae6dafd2815038/href

The Griewank Function.

https://medium.com/media/25bcc4f3f59cb6b8c9508af32a39b3cc/href

The Six-Hump Camel Function. Notice the many local minima of the function.

https://medium.com/media/669c8fa61ea6412bb136741b54b0ac27/href

Let’s re-visualize a gridded descent of the Himmelblau Function. Notice how different parameter initializations lead to different minima.

https://medium.com/media/b46d8cfd4759e0e25cadcd428c3212e8/href

And lastly, the Easom Function. Notice how many points sit still because they are initialized on a flat gradient.

https://medium.com/media/d5482a3729bd59bfba8eeb9dfc6b9b27/href

4. Conclusion: Contextualizing Gradient Descent

So far, we’ve worked through gradient descent with a univariate function and have visualized the descent of several multivariate functions. In reality, modern deep learning models have vastly more parameters than the functions that we’ve examined.

For example, Hugging Face’s newest natural language processing model, Bloom, has 175 billion parameters. The chained functions used in this model are also more complicated than our test functions.

However, it’s important to realize that the foundations of what we’ve learned about gradient descent still apply. During each iteration of training of any deep learning model, the gradient of every parameter is calculated. This gradient will then be averaged across the training examples and then subtracted from the parameters so that they ‘step down’ their gradients, pushing them to produce a minimal output from the model’s cost function.

Thanks for reading!

5. Resources

- Grad-Descent-Visualizer
- 3Blue1Brown
  - Gradient Descent
  - Derivatives
- Simon Fraser University: Test Functions for Optimization
- PyVista
- Michael Nielsen's Neural Networks and Deep Learning

Breaking it Down: Gradient Descent was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

NYT Mini Crossword — Group Competition Analysis

Jacob Bumgarner, Ph.D. — Sun, 03 Oct 2021 23:31:36 GMT

NYT Mini Crossword — Automated Competition Analysis

Typical group chat banter

Almost every day over the past two years, a group of friends and I have raced each other to get the fastest solve time on the daily NYT Mini crossword puzzles. These mini puzzles are found in 5x5 to 6x6 grid formats with 10 to 12 clues. Each day, we send our solve times and associated trash talk to a group chat.

It’s pretty obvious who always wins in our group (hint… it’s not me), but I’ve always wanted to formally visualize our solve times across extended periods of play.

Before I started learning how to program, we used to manually type in all of our times into excel each month to visualize scores. However, after learning to use python, I knew that I had to figure out how to automate extracting these times from our group chat for easy analysis. Below I detail the methods that I used to filter solve times from our group chat, analyze solves, and export data for easy plotting.

Placements and solve times from September, 2021

Extracting Texts

To start, I use iOS/MacOS, so all of my texts are sent using iMessage. Apple locally stores all iMessage texts in an SQLite database called ‘chat.db’ in the hidden user Library file. Prior to working on this project, I was unfamiliar with SQLite, so I had to learn some of the syntax!

To scrape all of the texts from our group chat from a specific period, I wound up using this lovely command:

SELECT text, handle_id, date FROM message MSG 
     INNER JOIN chat_message_join CMJ 
     ON CMJ.message_id = MSG.ROWID 
     INNER JOIN chat 
     ON chat.ROWID = CMJ.chat_id 
     WHERE (chat.display_name = "Double Dash 🥊" AND date > {start_time} AND date < {end_time}) 
     ORDER BY MSG.date ASC;

Basically, from every text in our group chat, I want to get the text string, the text sender (handle_id), and the send date. The chat.db stores all messages without explicitly stating what group chat they’re from. To link each message with the chat it was sent in, we have to use the “chat_message_join” database. Then, because I want to find texts using our group chat’s string name (rather than it’s “chat_id”), I have to link the chat_message_join with the actual chat database where I can explicitly filter texts to only come from our crossword chat. I also limit the texts to specific date windows and sort them chronologically. To facilitate automation, I use the python sqlite package to make this request.

Filtering Times

After dealing with this mess of a database inquiry, I’m left with a ton of texts. These texts contain solve times, trash talk, and oftentimes solve time typo corrections.

In brief, I end up using the python re package (ReGex) to filter texts that only contain solve times (i.e., #:## format). If multiple times are sent from the same person within a single day, I assume the most recent one is the correct time and use that one.

results = re.findall("\d:\d\d", text)

After I filter the banter texts from solve times, I store all of these texts in a Score class, with text, solver, and date information.

Analyzing & Exporting Data

Analysis portion of the code

After a bit more organization and time filtering, I’m left with raw data that I can analyze. I’m particularly interested in the effect of time of day on solves, so I keep close track of that. At the moment, the program finds placements, average solve wins/losses, average win/loss time of day, and averages of solves by weekdays.

Lastly, I export using the data using the csv package. So far, I’ve just been visualizing these data with GraphPad, but one day I may try to create a R script that will generate these graphs for me automatically.

Terminal output

For now, the code I wrote to use extract and analyze times can be found at my GitHub page below:

GitHub - JacobBumgarner/Daily_Mini_Analysis: A short python script written to scrape iOS text messages for crossword solve times.

See some of the other results below!

Saturdays are clearly our slowest days — the puzzles are much harder then.

Solves across the month and averaged by weekday.

It doesn’t seem like there’s a clear effect of time-of-day on the solves, but maybe when I add more data, some trend may appear.

Solve speed/time of solve for all players

Disclaimer: I am in no way affiliated with the NYT. Thanks for reading!