Stories by University of Toronto Machine Intelligence Team on Medium

PhotoML: Photo culling with Machine Learning

University of Toronto Machine Intelligence Team — Tue, 06 Jun 2023 23:01:29 GMT

A UTMIST Project by: Gursimar Singh, Fengyuan Liu, Yue Fung Lee, Jingmin Wang, Dev Shah, Chris Oh, Muhammad Ahsan Kaleem.

Photo Culling is a tedious, time-consuming process through which photographers select the best few images from a photo shoot for editing and finally delivering to clients. In this project, we develop the software stack, user-facing product, and backed ML solution for PhotoML, a startup aiming to solve the aforementioned problem. We use computer vision techniques to automate the process of photo culling while also considering the individual artistic preferences of each photographer.

We use a variety of pretrained sub-models to do this, extracting features from each one and ensembling these features to make a decision as to which category an image belongs to.

Background Information

Photographers typically take thousands of images in any given shoot, but less than ten percent of those images are selected to be in the final product that is delivered to the client. This process of filtering out the best few images from thousands of images is known as culling and takes a significant amount of time. For this project, we aim to sort images into three buckets; The first consists of images the photographer will definitely keep in the final product, the second is images that would definitely be removed and the third is a maybe-bucket, from which it is up to the photographer which bucket to place the images in.

For this project we limit our scope to wedding photographers, therefore the dataset of images consists primarily of images with multiple people in the frame. As such, some of the features that a photographer may look out for while culling would be;

1. Ensuring that all subjects visible in the frame have their eyes open and are focusing on the camera.

2. Ensuring that there is no motion blurring due to the movements of subjects in the frame when the picture was taken.

3. Not selecting more than 2–3 images of the same scene, since 10 or 20 images are sometimes taken of the same scene at once.

4. Ensuring that the subjects in the frame show a particular emotion. This may be an emotion like happiness or sadness, depending on the scene and photographer.

5. Selecting images with adequate lighting and good color composition.

6. Ensuring that the selected images make good use of the space available in the frame. Typically, photographers would use the rule of thirds [1], in which they would place their subjects in the left or right third of an image and allow the background scenery to occupy the rest of the image. However, for wedding photos, photographers may want their subjects to occupy the center of the image and take up most of the frame since the focus of the photoshoot is the people involved.

We use a different submodel to account for each of the aforementioned criteria when performing image culling.

Why manual feature selection and separate submodels?

There are three main reasons we decided to go for an ensembling approach with fixed, predefined features to extract rather than training a conventional CNN to automatically extract arbitrary features for classification.

The difference between culled and selected images is minute, since all it takes is for one subject to have their eyes closed, or one blur or a single, small change in camera angle. For this reason, simply using a CNN for classification would not be effective, since minute details like this would not be extracted as features.
Most photographers would generally agree on which images are good, which means that individual artistic style only accounts for a small percentage of selected images. Moreover, the features that photographers look out for are finite and well-defined, therefore it is possible to have a feature extraction submodel for each.
It is easy to add more submodels over time due to this modular approach.

Our approach

A diagram of our approach. The green boxes represent pretrained models while the blue boxes represent models that we trained.

The diagram above shows our approach to image culling, wherein each submodel is used to extract features from the images. The ensembling model takes as inputs the combined outputs of all submodels, essentially serving as the network that learns how much importance a given photographer assigns to each of the extracted features. Based on the output of the ensembling model, we sort images into the three buckets as shown in the diagram above.

It is important to note that due to our choice of using multiple submodels, computational efficiency is more important to preserve than small increases in accuracy. Slight improvements in accuracy through state-of-the-art methods are not very useful, since output ensembling means that the output of a single model does not generally have a large impact on the overall prediction score. However, the use of multiple submodels means that our approach uses a large amount of computing power, therefore picking submodels that are relatively simple but use less compute is the approach we take.

Duplicate detection

Duplicate detection is the process of selecting and grouping near-duplicate images. In any given shoot, the photographer takes multiple images of the same scene to ensure that there is at least one image without movement blurring or closed eyes. The duplicate detection model identifies these duplicate groups given a dataset consisting of all the photos in a shoot.

This is essentially an unsupervised learning method in the likes of image clustering, except we cannot use traditional approaches like K-Means clustering because we do not know how many clusters exist prior to duplicate detection, since there could be any number of duplicate groups in the dataset.

This model is an essential preprocessing step required before classifying images in a photoshoot. Without duplicate detection, multiple images from the same duplicate would be assigned a similar prediction score since the images are similar, making it so that all the images are either culled or selected, which is not ideal since photographers would only select the best few images from each duplicate group. Therefore, this model is an essential preprocessing step ensuring that the output is diverse and various images from different groups are selected.

The duplicate detection approach is relatively simple and takes the following steps;

A pretrained MobileNetV3 [2] model is used to extract features from images. Each extracted feature can be flattened to a vector in a high-dimensional latent space.
Clustering the extracted features using an approach that allows for an arbitrary number of clusters.

This is a relatively simple approach compared to state-of-the-art methods like SIFT or ORB but uses significantly less computational power.

A visualization of the process of duplicate detection

Closed Eye Detection

The closed eye detection submodel allows us to identify how many subjects in the frame have their eyes closed. Typically, images with closed eyes are not desirable, but in a wedding shoot, subjects may have their eyes closed for multiple reasons (they may be crying for example), and the images may still be desirable. These nuances are expected to be learned by the ensembling model.

For this approach, we began by detecting facial landmarks using dlib [3], a Python library that can detect 68 landmarks on human faces. Each eye is marked with 6 landmarks when using dlib. The ratio of the distances between these landmarks can be used to identify whether the eye is closed or not.

Eye landmarks as detected by dlib [3]

Formula for EAR [4]

Intuitively, we can see that when the distances between p2 and p6 and between p3 and p5 are large, the eye would be open. We simply add these distances and normalize by dividing by the 2 times the distance between p1 and p4, which would allow this ratio to be invariant of the distance of the eye from the camera. This EAR score is then output by this model and combined with the other activations.

Emotion Detection

The emotion detection submodel identifies faces and their associated emotions given an image, outputting the average emotion of the image. A photographer may want to look out for this particular feature, since the emotions of subjects in a wedding are important when taking photos.

To do this, we used the facenet-pytorch library [5], which uses the Multi-Task Cascaded Convolutional Neural Network (MTCNN) [6] model to detect bounding boxes around faces. We then created a custom convolutional neural network and trained it on a dataset of faces annotated with emotions. With these two models, we were able to first identify and crop faces in an image and then pass the cropped faces through the emotion recognition neural network, which classified their emotions as one of anger, disgusted, fearful, happy, sad, surprised, or neutral. The extracted probabilities are then stored and combined with activations from the other models.

Outputs of the emotion detection model

Neural Style Extraction

The neural style extraction submodel is used to account for features like color composition and lighting, which dictate the general theme of an image, and a photographer may want to look out for these features when performing image culling.

This is a model that is different from the rest in the sense that there are no learned parameters in this model. Instead, we adopt the same approach to obtaining the stylistic representation of an image as in the original neural style transfer paper by Gatys et al [7].

We use the Gramian matrix of an image as the representation of the style of the image. This matrix essentially removes all information about the structure of the image and only accounts for how the color channels of an image relate to each other. The steps for computing a Gramian matrix are as follows;

Take the original image of shape (Color Channels © x Height (H) x Width (W)) and flatten the height and width dimensions to obtain a tensor of shape (C x (H*W)).
Perform matrix multiplication on the matrix obtained in the previous step with a transposed version of itself. The output of matrix multiplication will be a C x C matrix that is the Gramian matrix.

A visual demonstration of the process of obtaining a Gramian matrix from an image

So why is the Gramian matrix considered to be a good representation of the style of an image? In the first step, when we collapse the height and width dimensions of the matrix, we essentially discard any information about the structure of the objects in the image, since we just flattened a 2-dimensional image into a single-column vector. When multiplying this matrix with a transpose of itself, in the output matrix we obtain, every element is the dot product of a color channel with another color channel in the original image. We have already seen that the dot product serves as a measure of similarity of two vectors, therefore, in this case, every element in the output matrix represents how similar a color channel is with another color channel. On a higher, more abstract level, we simply refer to the relation of colors with each other as the style of an image, which is why the Gramian matrix is a good representation of that.

This can be easily verified with an example;

Even though img1 and img2 are structurally similar, their gram matrix loss is higher than that of img2 and the style image, which are stylistically similar. This shows that a gram matrix essentially discards most information about the structure of an image and represents the style of the image.

Although the example only details Gramian matrices on images, the same process can be performed to obtain Gramian matrices from the features of images obtained after passing them through a pretrained CNN. This becomes a more advanced representation of style, not only accounting for color compositions but also texture and other advanced stylistic features.

The Gramian matrices obtained from this step are also combined with activations from the other submodels. An example with images to demonstrate that the Gramian matrices are a representation of style.

Semantic Segmentation

The semantic segmentation model is meant to account for the adequate use of the space in the frame, segmenting the image into the subjects and the background. This allows for the detection of whether most of the space in the frame is being occupied by the subjects, which is usually desired for a wedding photograph.

For this submodel, we used the mmsegmentation library [8], using the MobileNetV3 backbone trained on the ADE20k dataset [9]. We found that this model struck an ideal trade-off between efficiency and performance. This particular dataset was chosen since it has image classes similar to the objects that would commonly be seen in wedding photos, including person, window, building, wall, etc.

Example segmentation map from our segmentation model

The segmentation map output from this model is also combined with the activations of the previously mentioned submodels.

Training

During training, we first passed our images through the pretrained submodels, the features output from each one.

We trained the ensembling model in two steps;

We combined the outputs of the pretrained submodels to ensure uniform structure before passing these to the ensembling model.
We passed the combined outputs from the submodels into the ensembling model, optimizing it to classify images based on the features from the submodels.

By not doing direct classification with a single model, we ensure that the ensembling model learns useful features that are relevant to the task of image culling and does not overfit on the small amount of training data that each photographer can provide to the model. This also speeds up model convergence, since features are predetermined and no feature extraction layers need to be optimized.

Conclusion

In summary, we chose to take a modular approach to image culling, using multiple submodels to simulate searching for features that a photographer would look for. This added flexibility to our approach since it was easy to add and remove submodels as needed and it also helped with classification, since we were able to specify precisely which features should be considered.

We then stacked an ensembling model on top of the submodels. This stacked structure allowed us to split training into multiple steps, using only pretrained features to perform classification.

This approach to training allowed for quicker convergence since features were predetermined.

As a future improvement for this task, transformer architecture based approaches are likely worth exploring due to their ability to pay more attention to relevant parts of the input images. Adding more submodels that account for more features could also help with performance on this task.

Photo culling is a difficult task that even the most experienced photographers find difficult to do. It was an ambitious attempt to use deep learning techniques for a task that is so intrinsically nuanced and while we did achieve fairly good results, development remains underway and the future likely holds many improvements for this project.

Citations

[1] https://www.adobe.com/ca/creativecloud/photography/discover/rule-of-thirds.html

[2] https://arxiv.org/abs/1905.02244

[3] http://dlib.net/

[4] https://medium.com/analytics-vidhya/eye-aspect-ratio-ear-and-drowsiness-detector-using-dlib-a0b2c292d706

[5] https://github.com/timesler/facenet-pytorch

[6] https://arxiv.org/abs/1604.02878

[7] https://arxiv.org/abs/1508.06576

[8] https://github.com/open-mmlab/mmsegmentation

[9] https://groups.csail.mit.edu/vision/datasets/ADE20K/

An Application Of Deep Learning Models To Reconstruct ECG Signals

University of Toronto Machine Intelligence Team — Mon, 05 Jun 2023 23:02:17 GMT

* Reference for image: A. J. Huber, A. H. Leggett, and A. K. Miller, “Electrocardiogram: Blog chronicles value of old, but still vital cardiac test,” Scope, 19-Dec-2017. [Online]. Available: https://scopeblog.stanford.edu/2016/09/21/electrocardiogram-blog-chronicles-value-of-old-but-still-vital-cardiac-test/. [Accessed: 24-May-2022].

By: Yan Zhu, Chunsheng Zuo, Yunhao Qian, Guanghan Wang

Abstract

An electrocardiogram (ECG) measures the electrical activity of the heart. A complete set of ECG consists of 12 signals, yet measuring 12 leads of an ECG can be time-consuming and resulting in higher misdiagnosis rates. This motivates us to design a machine learning solution to reconstruct the 12 leads ECGs with 2–3 leads of input to improve the data collection efficiency and reduce the chance of lead misplacement.

Our team examined 5 different models, including linear regression, CNN, and LSTM, which have been applied in prior works, and U-Net and transformer, which are more advanced models that have not been used for ECG reconstruction before. The result has shown that the transformer beats other benchmarks and achieves the best overall performance in terms of RMSE, Pearson correlation coefficient, and the qualities of the reconstructed ECG diagrams.

Background

What Is ECG?

The heartbeats are primarily controlled by a group of autorhythmic cells lying on the right atrium of the heart, namely the sinoatrial node (or SA node). The electrical signals generated by the SA node spread through the entire heart and result in regular muscle contraction/relaxation cycles. An ECG measures and records the electrical potential changes of these signals. A healthy heart should have a normal rate (between 60–100 cycles per minute) and a constant pattern that contains the P wave, QRS complex, and the T wave. These waves correspond to the contraction and relaxation of the heart atria and ventricles. Many cardiac diseases can be identified from the ECGs; for instance, Figure2 shows the ECGs of some typical CVDs.

Figure1: Segments In ECG Signal (L. Sherwood, Human physiology: From cells to systems. Boston, MA, USA: Cengage Learning, 2016.)

Figure2: Normal ECG vs. ECGs for CVDs (L. Sherwood, Human physiology: From cells to systems. Boston, MA, USA: Cengage Learning, 2016.)

ECGs are an essential first-line tool in the diagnosis of many different types of heart conditions. According to the Centers for Disease Control, more than 69 million ECGs are performed annually in the United States during doctor’s office and emergency department visits.

Taking an ECG test requires a health professional to attach 10 electrodes. These electrodes are used to generate what is called a 12 lead ECG. The accuracy of limb electrode placement does not have much impact on the signal, however the location of chest electrodes V1 to V6 are critical. Studies have shown that even trained health professionals can misplace the chest electrodes. One study found that only 10% of participants (doctors, nurses, and cardiac technicians) correctly applied all the electrodes[3]. Misplaced chest electrodes can change the expected signal and result in missing certain diagnoses. This motivates us to seek a technical solution that can reduce the time and complexity of ECG collecting and the risk of misdiagnosis.

Figure 3: 12 lead ECG electrode placement locations. The position of the limb electrodes can be anywhere along the arms or legs without much effect on the recorded ECG signal.

Figure 4: 3 Resulted electrode placement locations. These 3 electrode locations will be used to measure ECG leads I and II.

Research Goal Overview

The project will evaluate how well machine learning reduces the need for ten electrodes and if it can reduce the need for error-prone chest placement; thus, speed up the ECG measuring process and eliminate the error associated with chest electrode placement. More specifically, we aim to design a deep learning model that only uses signals from three or two electrodes and can synthesize the rest of the missing leads. We will use Lead I and Lead II as shown in Figure 3 and use the signals collected at LA, RA and LL as the input for the neural net.

Linearity of the First Six Leads

Figure 5: Linearity between the first 6 leads

The first 6 leads of ECG signals: lead I, II, III AVL, AVR, AVF have linear relationships — as long as having the information of 2 out of the 6 leads, we can compute the rest 4 signals with high accuracy. Thus, instead of training a machine learning model to reconstruct these leads, we decide to compute the remaining leads given Lead II and Lead III. The machine learning models discussed in the following paragraphs serve for the reconstruction of the last six leads of ECG, i.e., V1-V6.

Dataset

Two ECG datasets have been used to train our ECG reconstructors, PTB-XL and CODE-15%. Both datasets can be easily downloaded from the Internet.

Preprocessing

The raw signals cannot be directly used as model inputs/references because they are noisy, and more importantly, because they suffer from the baseline wander problem. If the baseline wander is not corrected, it would become the major source of training loss. To alleviate the problem, both input and reference ECG signals are passed through a bandpass Butterworth filter with cutoff frequencies at 0.5 and 60 Hz. Such a filter removes the baseline wander and high-frequency noises without losing too much information about the ECG signals.

In each dataset, the ECG signals are annotated with many labels. In particular, many are boolean labels indicating whether the belonging patient is diagnosed to have certain diseases like infraction, hypertrophy, and ischemia. Although these labels have not been taken into account by our reconstructor models, they will become important e.g. when we measure the accuracy of the diagnostics based on the reconstructed signals. Therefore, when we split a dataset into training, validation, and test sets, we must take into account the fair distribution of labels. To ensure this, stratified sampling is applied to the joint distribution of all the labels when a dataset is split. Then, the distribution will be consistent across the splits for an arbitrary combination of labels

Another problem that we ran into when preparing the dataset is that the CODE-15% dataset, which was used mainly for classification tasks, has many low-quality signals with wide zero paddings and extreme values at signal endings (e.g., the measured voltage can approach to near infinity when the electrodes are being taken off from the patient’s body). These corner cases were not fatal for classifiers, but they turned out to cause a lot of trouble when we train reconstructors, especially large MSE losses. It is hard to overcome these problems in the runtime, so we decide to remove the bad elements when preparing the dataset. The following criteria have been used:

Because too many signals are zero-padded, and because we do not want to drop too many training examples, we clip all the ECG signals from 4096 to 2934 samples (which is roughly the 95-percentile of the unpadded lengths) while the paddings are removed.
To remove ECG signals that are too noisy for reconstruction, we pass the signals into the R-peak detector of the neurokit2 python package. A signal is discarded if the number of detected R-peak is smaller than 4.
To remove ECG signals that are flat (i.e., do not contain any ECG signal) or have extremely large values, we measure the voltage range of the clipped signals. A signal is discarded if the voltage range is smaller than 0.6 or larger than 9.0 (which are roughly the 5 and 95 percentiles, respectively).

For each raw ECG signal, a sliding window is run, and the first sliding window that satisfies all three criteria is adopted. If no sliding window can be found that satisfies the three criteria, the ECG signal will be discarded.

Format Conversion

To abstract away the differences between individual ECG datasets, and to accelerate the loading of huge ECG data (~3G for PTB-XL and ~30G for CODE-15%), the datasets are converted into a common format. The following three technologies have been attempted.

TFRecord (loaded with the tensorflow_datasets library). This format provided fair performance in the runtime. However, we abandoned it finally because:
The multi-threaded data loading provided by tf.data.Dataset is not fast enough, while using multi-processing provided by torch.utils.data.DataLoader causes even more trouble (when TensorFlow is initialized on every worker process).
Elements of it have to be accessed sequentially, so it is really painful to access an element by index in linear time (which means waiting for minutes) when we want to debug our models on a specific ECG signal of interest.
NumPy arrays opened with mmap. This format allows accessing elements by indices and provides great performance on our personal computers. However, it turned out to be very slow on network drives, in particular the Google Drive folders mounted on Colab machines. Note that it is impractical to always copy data to the local storage of Colab because the local storage is insufficient for larger datasets that we are going to use. Therefore, we do not use it in the end.
HDF5 (loaded with the h5py library). This is the format that we finally adopted. It allows accessing elements by indices, works happily with the multiprocessing of PyTorch data loaders, and gives good performance on both local devices and Google Colab.

Evaluation Metrics

Loss Function

Up to now, only the mean square error (MSE) has been used as the loss function for optimizing the reconstructor models.

Metrics

Two metrics have been used to report the performance of the reconstructor models.

Figure 6: Root Mean Square Error

Figure 7: Pearson Correlation Coefficient

Root mean square error (RMSE) measures the absolute difference between reconstructed and reference signals. Lower numbers indicate better reconstruction. This metric is simply the square root of MSE, but RMSE is reported instead of MSE to give people more intuition about the magnitude of differences.
Pearson correlation coefficient (PearsonR) measures the relative difference between reconstructed and reference signals. Higher numbers closer to 1 indicate better reconstruction. Note that in the original definition of PearsonR, both output and reference distributions are assumed to be normal, and the computed number roughly indicates the probability that the output and reference are from the same distribution. However, the voltage of ECG signals does not follow a normal distribution; the voltage distribution has a spike around the zero point, as shown in the figure below. Therefore, the computed PearsonR values do not indicate any probability. Instead, they are just a rough indicator of the relative closeness between reconstructed and reference signals.

Figure 8: Voltage distribution of ECG signals, plotted as a histogram with 200 bins.

Models

All our models take in ECG signals with reduced lead sets and output reconstructed ECG signals.

1. Linear Regression

We first implement a simple linear regression model that maps the reduced lead ECG signals to the missing lead ECG signals. The model is implemented as a single linear layer with no activation function. The number of input channels is equal to the number of input leads, and the number of output channels is equal to the number of output leads. The model learns a linear mapping between the reduced lead ECG signals and the missing lead ECG signals.

We chose to implement the linear regression model first due to several reasons. The model is simple and fast to train, and it is a suitable baseline model to evaluate the performance of more complex models. Also, it is highly interpretable, as the coefficients of the linear transformation can be directly inspected to determine the relationship between the input features and the output. This explainability also helped us identify and verify the linear relationship among the first 6 leads to some extent.

2. Convolutional Neural Network (CNN)

Next, we implemented a Convolutional Neural Network (CNN) consisting of a stack of 1D convolutional (Conv1d) layers, each followed by a LeakyReLU activation function. The number of Conv1d layers in the stack is determined by the parameter n_layers. Every layer will have an increasing number of output channels. The number of output channels in the final layer, though, equals to the number of output leads.

Each Conv1d layer has a kernel size of 3 and uses ‘same’ padding to ensure that the output size matches the input size, which is important to ensure the length of the reconstructed signals is the same as the input signals. Every Conv1d layer extracts local features (such as the QRS complexes) from the reduced ECG lead signals, which are then combined and refined by subsequent layers to form higher-level features that can represent the information from multiple leads to reconstruct the missing channels.

The LeakyReLU activation function is used after each Conv1d layer to introduce non-linearity into the model, thus enabling the model to learn complex, non-linear relationships between the input ECG leads and the missing ECG leads.

3. Long Short-Term Memory (LSTM)

Another natural fit for processing temporal sequence is the Long Short-Term Memory (LSTM) model. It is a type of Recurrent Neural Network (RNN) that is capable of learning long-term dependencies and patterns in temporal data, which is useful for our task of reconstructing missing ECG leads.

The model has num_layers LSTM layers with n_hidden hidden units in each layer. The input_size of the LSTM layer is the number of input leads. The output of the LSTM is passed through a LeakyReLU activation function to introduce non-linearity. Finally, the output is passed through a linear layer to produce the reconstructed ECG signal.

4. U-Net

As our CNN showed good performance in reconstructing ECG leads, we would like to try a more complex CNN-based model to see if it can further improve the performance. We chose to implement the U-Net model, which is a popular CNN architecture that excels at many tasks. The U-Net model consists of an encoder and a decoder. The encoder network downsamples the input while the decoder network upsamples it back, and, in our case, reconstructs the missing leads in the process. The architecture is designed to capture both local and global context information, which is useful for reconstructing ECG leads. Our implementation of U-NET uses 1D convolutional layers instead of traditional 2D convolutional layers for image segmentation tasks.

5. UnetFormer

While U-Net consisting of encoder and decoder networks is good at capturing local and global context information, not all time instances of the ECG signal are equally important. Therefore, as a next step after U-Net, we propose a Transformer-based model to further improve the performance of ECG reconstruction, which we name UnetFormer. UnetFormer is a sequence-to-sequence model that is based on the attention mechanism. Similar to U-Net, it also has an encoder and a decoder network.

In our implementation, PositionalEncoding is responsible for encoding positional information into the input ECG sequence before feeding the input into Transformer’s encoder. The Transformer’s encoder consists of a series of convolutional layers that transform the positional encoded input sequence into a sequence of embeddings. Mirroring the encoder, the decoder uses a series of convolutional layers to transform the Transformer output back into a sequence of ECG signals, utilizing the information gathered by the encoder to reconstruct the ECG signal in the process.

UnetFormer is a solid choice for reconstructing ECG signals since it can effectively model the complex relationships between the different ECG leads. However, it is not able to capture the local context information as efficiently as U-Net. Therefore, we proposed a hybrid model that combines the strengths of both U-Net and Transformer. The hybrid model wrapped a Transformer inside U-Net. U-Net will encode the input, pass the result to the Transformer, and then reconstruct based on the output of the Transformer.. The U-Net is responsible for capturing the local context information, while the Transformer is responsible for capturing global dependencies

Results

Quantitative analysis

Table 1: Model Performance Comparison

The results for models trained and tested on PTB-XL and Code15% dataset are summarized in table 1. We also recorded their floating-point operations per second in units of 109 (GFlops) as a reflection of their computational complexity (the lower the better). In terms of both PearsonR and RMSE, UnetFormerh has a slightly better performance than U-net. On PTB-XL, UnetFormer outperforms U-net on RMSE by 0.28%, while U-net is 0.03% higher in PearsonR. On Code15%, UnetFormer outperforms U-net on both PearsonR and RMSE by 1.26% and 0.007%. Though UnetFormer does not lead by a large margin, it has 38.9% less computational complexity than U-net, proving its efficiency in temporal signal understanding and making it a more favourable choice. In addition, by looking at the overall model performance on each dataset, it can be seen that all models achieve better results on PTB-XL than Code15%. This is a reflection of the noisiness and the difficulty of each dataset, which agrees with our observation while we prepared them. The highest performance drop occurs on LR, which has a 39.4% decline in PearsonR and a 106.4% increase in RMSE. Though other models also have a much worsen RMSE, their PearsonR values are still capable of maintaining a similar level, implying that the ability to capture non-local information of the signal is indispensable in handling the reconstruction of noisy signals. As U-net and UnetFormer are both competent in capturing the ECG waveform patterns for each output lead, which can serve as a guide for reconstruction, they are able to obtain a more accurate mapping from the 3 input ECG leads to the back 6 leads.

Qualitative analysis

Figure 8: Reconstruction visualization for samples from PLB-XL and Code%. The first figure on the left is a reconstruction for lead V4 of a sample from PLB-XL. The second figure from the left is a reconstruction for lead V6 of a sample from PLB-XL. The blue curves are the ground truth, while red curves are the reconstructions.

Looking at the model reconstruction results for some of the samples, it is obvious to see that LR often resulted in a much lesser amplitude in the reconstructed signal. CNN matches both the amplitude and the waveform much better than LR, but we can still clearly identify its gap with the rest of the models. For LSTM, U-net, and UnetFormer, most of the time the difference is minimal. In the reconstruction for the same sample, one of the 3 models may handle some parts of the signal better than other models, but may also have some issue regarding some other parts of the signal that does not happen to other models, making it hard to conclude which model provides the best reconstruction in terms of sheer visualization.

Conclusion

In this work, we applied machine learning techniques to reconstruct 12-lead ECG signals using only 3 input leads. We developed, trained, and evaluated five different machine learning models, and compared their performances using RMSE and Pearson Correlation Coefficient. Among the five candidates, transformer outperforms the rest of the models and achieves the highest Pearson CC and the lowest RMSE. One interesting future research direction would be to reduce the number of input leads (for instance, using only 2 or 1 lead to reconstruct the complete 12 leads of ECG signals). This may require other machine learning models that we have not explored in this work, such as GAN.

Wall Street Bots 2: Crypto Price Prediction Using Machine Learning

University of Toronto Machine Intelligence Team — Thu, 01 Jun 2023 23:01:45 GMT

A UTMIST Project by: Lisa Yu, Fernando Assad, Andrew Huang, Eliss Hui, Anand Karki, Sujit Magesh, Geting (Janice) Qin, Peter Shi, Dav Vrat, Nick Wood, Randolph Zhang.

WallSteetBots2 is a 6-month project to predict cryptocurrency prices using Twitter tweets and market data through applying machine learning techniques.

1. Background

1.1. Motivation

There is currently a high level of interest and investment in cryptocurrency markets, but the complex and ever-changing nature of these markets has made it challenging for investors to accurately predict price movements. In comparison to other financial instruments like stocks, bonds, and options, cryptocurrencies also have high quality market data that are freely accessible to the public.

As a result, there has been a growing interest in applying machine learning and natural language processing techniques to cryptocurrency price prediction. Machine learning algorithms are designed to learn patterns and relationships in data, which can be useful in predicting the direction and magnitude of future cryptocurrency price movements. On the other hand, natural language processing can be used to analyze social media sentiment and news articles pertaining to cryptocurrencies, which can help gain insight into market trends.

Our models are trained on data scraped from Twitter as well as high frequency market data from CoinAPI, Alpaca, and other sources.

This article will highlight our research on forecasting short and medium term returns in cryptocurrency markets using machine learning and natural language processing techniques.

1.2. Previous Iteration of WSB

This WallStreetBots-Crypto is a continuation of last year’s iteration of the WallStreeBots project. Previously, WSB built a stock-trading AI that attempted to predict the next minute price of common “meme stocks” like GME based on real time Reddit sentiments and implemented several portfolio optimization techniques to periodically rebalance the portfolio for risk-return optimization. The entire pipeline was implemented on the WallStreetBots terminal (http://www.wallstreetbots.org/). See a screenshot of the terminal dashboard below.

Although last year’s WSB project was able to achieve over 66% directional prediction accuracy, we found that the greatest limitation was the lack of publicly available high-quality market data for stocks. These data, such as the exchange trades data, are only accessible to professional trading firms. Hence in this iteration of the project, we shift our focus to predicting cryptocurrency prices given their transparent nature.

1.3. Previous Academic Work

1.3.1 Twitter Sentiment Analysis

In recent years, researchers have started to notice the power of Natural Language Processing (NLP) and text mining for predicting the financial market. Using user sentiment on social media platforms to predict various financial assets has become an active area of research. One of the most widely used social media platforms in this research is Twitter. On this platform, users can post short texts called “tweets”, which contain sentiments and moods. Since Twitter was founded in 2006, it has become increasingly popular. In 2022, Twitter announced that it would have 368 million active monthly users. Due to the popularity of Twitter, there has been a lot of research into detecting sentiment in tweets. For example, the Valence Aware Dictionary for Sentiment Reasoning (VADER) is a rule-based model for analyzing sentiment in social media text that achieves an F1 score of 0.96, outperforming individual human raters with an F1 score of 0.84. There is, however, research showing that around 14% of Twitter content for Bitcoin is sent from bots (Kraaijeveld and Smedt, 2020). Nevertheless, previous research from Antweiler and Frank (2004), and Bollen et al. (2011) demonstrates that using social media texts can help in predicting the market.

1.3.2 Bitcoin Price Prediction and Forecasting

In 2017, Stenqvist and Jacob Lönnö explored the use of sentiment analysis on Twitter data related to Bitcoin to predict its price fluctuations. In their work, a naive prediction model is presented, which shows that the most accurate aggregated time for predictions is 1 hour, predicting a Bitcoin price change for 4 hours in the future. Moreover, Mallqui and Fernandes (2019) examined various machine learning models to predict price direction with the best model in their work achieving a 62.91% directional accuracy. Multiple machine learning models are also used in the work of Chen et al. (2020), the results show that the Long Short-term Memory (LSTM) model achieves better performance than other methods using the previous exchange rate, such as autoregressive integrated moving average (ARIMA), support vector regression (SVM), etc. During the same year, Kraaijeveld and Smedt (2020) showed that Twitter sentiments are feasible to predict the price of Bitcoin, Bitcoin Cash, and Litecoin through granger causality testing.

1.4. Trading Intuition

In addition to relevant academic research, it is also important to consider the trading intuition that explains the fundamental crypto price movements driven by market supply-and-demand. Throughout the project, we ensure that our results are aligned with trading intuitions.

2. Data Collection

2.1. Tweets for NLP

2.1.1. Scraping and Cleaning

A tweet scraping script was developed and executed using Snscrape to collect data on 14 different cryptocurrencies (However, the scope of this project only explored price predictions of Bitcoin). The script required a specified date range and retrieved 100 tweets per hour within that range. The script incorporated a set of heuristics to mitigate spam accounts that post tweets on Twitter. The heuristics approved accounts that were verified and rejected accounts with low follower-to-following ratios, specifically those with less than 10% of followers compared to their following, accounts that tweet over 200 times daily on average, and accounts that follow the maximum number of other accounts as enforced by Twitter.

2.1.2. NLP Preprocessing — Sentiment Labeling of Tweets

The raw dataset of Tweets was processed using two pre-trained NLP sentiment analysis models in Python. The models used were VADER (Valence Aware Dictionary for sEntiment Reasoning) and an implementation of Google’s T5 model fine-tuned for emotion recognition.

VADER provides a single feature (“sentiment_score”) that measures the overall positivity/negativity of a Tweet, represented as a floating point number in the range of (-1, +1), where -1 indicates high negativity and +1 represents high positivity.

The fine-tuned Google T5 model provides six features that measure the intensity of specific emotions: happy, sad, anger, fear, and surprise. These features are represented as floating point numbers in the range of (0, 1), where 0 indicates no detection, and 1 indicates strong detection.

Multiple cloud computing instances from Google Cloud Platform were used to apply the two selected models to the full dataset of Tweets.

2.1.3. Processed Dataset Creation

Datasets with averaged NLP features were created for a variety of frequency intervals (1-hour, 4-hour, 12-hour, 1-day), merged with the corresponding frequency of Bitcoin price data on the Binance exchange provided by CoinAPI. These datasets provide measurements for the average intensity of features for each given time interval, allowing for NLP features to be correlated with the actual log return of Bitcoin across each period. Below is a table containing the features produced through manipulation of raw sentiment metrics across specific intervals.

2.2. Market Data

The project collected various market data by first coming up with a hypothesis based on trading intuition for why certain data should be a driver of Bitcoin price. Then exploratory cross correlation analysis is conducted on the feature (at time t) and Bitcoin’s log returns shifted (at time t+shift offset) to check for linear relationships. The team also conducted checks for non-linear correlation based on hypothesis tests. We do not report any non-linear correlation results since none were significant. However, we note that the implementation of the non-linear correlation test had a high false negative rate. See below table for features and correlation test results.

Market Data Features and Correlation Results

In the market data collection process, we note in particular the Python scripts used to collect the data features from CoinAPI. Scripts were written to generate CoinAPI keys, collect, and pre-process data. The first script generates CoinAPI keys and stores them for later use. A second script was used to retrieve Bitcoin to USDT and Ethereum to USDT price data from Binance and FTX exchanges, which were sourced from CoinAPI. Notice that we collected both Bitcoin and Ethereum data on both Binance and FTX exchanges because we wanted to check if there are any lead-lag relationships between the two coins’ prices on the two different exchanges. Due to the limitation of each key being able to send 100 requests every 24 hours, the script utilized a sliding window to ensure that each key is utilized to its capacity without encountering any error of exceeding the requests limit. The data was collected and preprocessed, with a remarkable rate of 1 million entries per hour, equivalent to 1.9 years of data per hour. Subsequently, the data was cleaned, and logarithmic return values were computed. Features for data collected included “Time_period_start”, “Time_period_end”, “Asset_ID”, “Count”, “Open”, “High”, “Low”, “Close”, “Volume”, and “Opening Day”, with period = 1 minute. These features were used to calculate additional features which included “Log_returns”, which is the natural log of closing price divided by opening price. Other preprocessing included transforming timestamps from UTC to UNIX format.

Similar python scripts were used to collect additional second-by-second Bitcoin limit order book and historical trades data. Intuitively, this data shows how many people are willing to buy or sell how much of Bitcoin at what price. It gives insight to the supply and demand of Bitcoin which fundamentally drives any price changes. An example limit order book (also known as trading ladder) is shown in the following figure.

Example Limit Order Book

For a price in each row, the left column shows the aggregated volume that people on the exchange are willing to buy at the price given, and the right column shows the aggregated volume that people are willing to sell for the price given. In the previous figure, we call 2168.00 the level 1 ask price and 290 the level 1 ask volume. Similarly, we call 2167.75 the level 1 bid price and 477 the level 1 bid volume. From CoinAPI, we collected 2 levels deep of the book data for Bitcoin on Binance exchange (features include “asks_lvl1_price”, “asks_lvl1_size”, “asks_lvl2_price”, “asks_lvl2_size”, “bids_lvl1_price”, “bids_lvl1_size”, “bids_lvl2_price”, “bids_lvl2_size”). From these, we further engineered the following features:

“bid_ask_spread” = “asks_lvl1_price” — “bids_lvl1_price”
“bid_ask_strength” = “bids_lvl1_size” + “bids_lvl2_size” — “asks_lvl1_size” — “asks_lvl2_size”

In addition to the limit book which can be thought of as snapshots at the beginning of each second, we also collected the aggregated data for all the trades that were matched and executed over each second interval, getting the following features:

“total_buy_size”
“total_sell_size”
“buy_sell_strength” = ‘total_buy_size” — “‘total_sell_size”
“mean_trade_price”
“min_trade_price”
“max_trade_price”

Overall we see that the Bitcoin block features and the book-and-trades features have significant correlations with lagged Bitcoin log-returns. Hence we proceeded to train various ML models on these features.

3. Fitting ML Models

Based on our choice of input features at time t, we attempt to predict the log-return of Bitcoin at time t+1, where each time interval could be one second, one minute, one hour, or one day depending on the dataset. This is a regression problem. All model hyperparameters are tuned using grid search unless otherwise stated. At this stage, we only evaluated the models based on directional accuracy and MSE. We will proceed to evaluate the promising models more thoroughly using more metrics in the following section.

3.1. Baseline Model — Simple Moving Average

We considered the simple moving average model with window size of 3 as a baseline model and achieved a directional accuracy of 50.6% and MSE of 2.548e-06.

3.2. ARMA Model with Historical Returns

Univariate ARMA-class models were considered. Testing was done on the G-Research Kaggle Dataset on Bitcoin log-returns using a simple ARMA model with automatic selection of data and seasonal lags. The MSE on the testing dataset was 1.32e-06 and the out of sample directional accuracy is 51%. It should be noted that out of sample predictions only utilized the first data point of the test dataset for forecasting.

3.3. Support Vector Machine (SVM) with NLP Sentiments Data

One of the initial approaches for predicting the log return of Bitcoin prices is employing support vector machines (SVM) on the average sentiment and mood scores per hour and day in the tweets data we collected. In particular, there are five moods: happy, sad, angry, surprised, and fear. After splitting the dataset based on the follower counts of the corresponding Twitter user of each tweet, there are 18 matrices. In particular, we used multiples of the average follower counts to separate the data. The tweets are categorized into three categories: low (lower than 2/3 of the average follower counts), middle (from the range of higher than or equal to 2/3 of the average follower counts and lower than or equal to 3/2 of the average follower counts), and high (higher than 3/2 of the average follower count). With a Support Vector Regression (SVR) model from the scikit-learn library in Python, tuning the regularization parameter and the tolerance parameter, the model achieves a directional accuracy of 56.3% with a mean squared error of 0.00665 and a mean absolute error of 0.0522 when using the daily data. However, SVM gave worse results when we shortened the time interval of the data to be one hour, the model now has a directional accuracy of 54.0% but an MSE of 1.61 and an MAE of 0.807. This increase in error signifies the SVM with linear kernel does not capture the data’s characteristics due to the data’s complexity.

3.4. XGBoost Model with NLP Account, Tweet, Mood and Volume/Price Data

An XGBoost (XGB) model was trained on hourly aggregate means of Account (follower, following), Tweet (likes, replies, retweets), and mood (happy, sad, fear, angry, surprise) metrics of the tweets scraped in addition to the Bitcoin volume and price data. This gave an MSE of 3.104e-05 and an accuracy of 53.1%.

3.5. LSTM with NLP Sentiments Data

The processed NLP data described above were used to train Keras LSTM models for each of the specified intervals. An LSTM model with 100 units followed by a dropout and dense layer was used, and hyperparameters were tuned using the Bayesian hyperparameter search available from Weights and Biases. The input variables were the 15 features described above, as well as the log return values (for previous periods) A 2:1 training/testing split was used, allowing for the model’s performance to be tested on ~1 year of unseen historical cryptocurrency data.

The test period of the best performing model trained on 24-hour intervals of averaged sentiment gave overall directional accuracy of 56% and MSE of 0.0013.

One notable trend that we consistently observed across our models was a strong correlation between longer sentiment intervals and improved trading performance. The worst performing model was trained on sentiment intervals of 1 hour, and the best performing model was the one training on the 24 hour interval. This has several possible implications: it is possible that the effects of Twitter sentiments on the cryptocurrency market are delayed, and tend to occur much after Tweets are published. Alternatively, it is possible that averaged sentiment metrics are more statistically significant when sampled across larger time intervals, and therefore provide a better indication of the direction that the market will take.

3.6. Linear Regression with Price and Volume Data

The linear regression model is used to predict the Bitcoin log returns using the price and volume data. There are the following model assumptions for a gaussian linear model: linearity, equal variance, normality and independence. In practice, the log-return of stock price approximately follows a gaussian distribution. In this case, it is assumed that the same distribution applies to crypto currency, which has normality and equal variance satisfied. However, no other information is available about linearity and independence. A regression model has to be created to examine the fit of the linear model. Before fitting the model, the dataset has to be cleaned. The main concern in the dataset is multicollinearity. Multicollinearity happens when there are variables that have high correlation with each other. In the training dataset, correlation can be found between count and volume, and count is eventually dropped. The final model has volume, high, low, open and close as parameters that predict log-return. The parameters have statistical significance. However, the linear correlation between log-return and covariates are very low (all below 0.01). This suggests that a linear model is not a good choice for predicting log-return. The MSE is 3.16044e-06 and directional accuracy is 50.6%

3.7. Single Variable LSTM with Price Data

A single variable LSTM model with 50 units, a dropout layer, and a dense layer was trained to predict the next minute Bitcoin prices based on the historical Bitcoin log returns where window size is 10. The model was trained using stochastic gradient descent and MSE loss. The model was tuned using grid search. The final test directional accuracy was 49.8% and MSE was 4.95e-07.

3.8. Multivariable LSTM with Book and Trades Data

A multivariable LSTM model with 100 units, a dropout layer, and a dense layer was trained on the book and trades dataset to predict the next second’s mean, min, and max trade price of Bitcoin by predicting the log returns of these values. The benefit of predicting not just the center but also the min/max trade prices allows us to predict a range of probable trade prices over the next second. This information could be useful for market makers to remove their quotes when the book is about to be eaten up or to become an aggressor and front-run upcoming trends. However, for the purpose of this project, we only evaluate the results for the mean price predictions to allow compatibility with other datasets/models explored. The multivariable features were reframed from time-series data using a window frame size of 10. The final model was trained using the Adam optimizer (SGD yielded suboptimal results) with decayed learning rate and MSE loss. The model was tuned using grid search. The final test directional accuracy was 66.6% and MSE was 1.45e-09.

3.9. HMM with Book and Trades Data

A multivariable Guassian Hidden Markov Model with 25 hidden states was also fitted on the Bitcoin book and trades dataset to predict the next second’s mean, min, and max trade price of Bitcoin by predicting the log returns of these values. One notable data pre-processing step prior to fitting the model is the normalization of all features to a small enough range. For example, the level1 ask price of the book is converted to the nominal difference of it against the mean trade price. This allows the HMM model to better fit the data using a small finite number of states since the model assumes the hidden states to be discrete. The final test directional accuracy is 57.4% and MSE is 1.89e-09.

Summary of All Models Fitted

4. Further Evaluation of LSTM with NLP Sentiments Data and LSTM with Book and Trades Data

Our best model fitted on NLP data was the LSTM model and the best model fitted on market data was the LSTM with book and trades data. Let’s evaluate these two models more thoroughly using more metrics than just directional accuracy and MSE. It is important to note the difference in data frequency between the two models. The LSTM fitted on NLP sentiments is predicting day-by-day log-returns whereas the LSTM fitted on book-and-trades dataset is predicting second-by-second log returns. Predicting longer intervals is a more difficult task so it is important to keep this in mind when directly comparing the performance of the two models.

4.1. Performance Evaluation Metrics

Evaluation metrics are essential for properly assessing the performance of each of our models. There are two main types of metrics we will use in this project: financial and statistical.

4.1.1. Financial Metrics

Financial metrics are highly useful in capturing the performance of the models in the context of risk and profitably in the market. We graph the cumulative portfolio return of a hypothetical $1 portfolio over time with reinvestment at each time step where the entire portfolio takes on either a long or a short position based on our model’s predictions. This can be compared to the returns of the benchmark portfolio of holding or shorting $1 of Bitcoin over the same investment period. If we wanted to minimize risk of wrong predictions, we could consider a cutoff value where our hypothetical portfolio would only take a long/short position if the predicted return is larger than the cutoff. This hypothetical portfolio return curve can also be inspected for any large drawdowns or fluctuations, ensuring stable returns over time. We also report the maximum drawdown of such a portfolio, its Sharpe ratio, and its Sortino ratio (Sortino ratio only penalizes risk associated with negative returns and not positive ones). In general, a model with a Sharpe ratio of greater than 1 is considered good.

4.1.2. Statistical Metrics

In addition to financial metrics, we also considered statistical metrics including the predicted directional accuracy, MSE, confusion matrix, and F1 score. We also visualized a scatterplot of the actual vs predicted log returns, the distribution of wrong predictions, and some sample predictions.

4.2. Performance of LSTM Fitted on NLP Sentiments Data

For the LSTM trained on the NLP sentiments (24hr interval) data, The final test directional accuracy was 56.5% and MSE was 0.0013. The hypothetical $1 portfolio that trades with reinvestments using our predictions over the test period of ~400 days made an approximately 7% return, outperforming holding bitcoin by ~6 times. The portfolio has a Sharpe ratio of 0.154, Sortino ratio of 0.163, and max drawdown of 15.38%. See below grapes for a visualization of some log return predictions and the confusion matrix if we view it as a binary classification problem. The F1 score is 0.564.

4.3. Performance of LSTM Fitted on Book and Trades Data

For the LSTM trained on the book-and-trades data, The final test directional accuracy was 66.6% and MSE was 1.45e-09. The hypothetical $1 portfolio that trades with reinvestments using our predictions over the test period of ~3500 seconds made an approximately 5% return, outperforming holding bitcoin by 47 times. The portfolio has a Sharpe ratio of 0.32, Sortino ratio of 0.48, and max drawdown of 0.02%. If we trade only based on signals of predicted returns above a cutoff, we were able to get a Sharpe ratio of above 1.2 and Sortino ratio of above 2.8. See graphs below for a visualization of some log return predictions and the confusion matrix if we view it as a binary classification problem. The F1 score is 0.625. Based on the distribution of wrong prediction percentages, we see that a lot of the wrong predictions happen when the actual movement of Bitcoin price is close to 0. This could be because the actual small movement of the actual price is due to noise.

Given the great performance of both models, we wondered whether combining both datasets together would result in better results than both individual models. The datasets have different frequencies; nlp data is hourly and book/trade data is in seconds. Due to time constraints of the project, we combined the two datasets by repeating the NLP sentiment data point for every book/trade data point. As a result, for every data point in the merged dataset it would have the latest nlp data that is strictly earlier than the data point so that look-ahead bias is avoided. The combined dataset was used to train a multivariable LSTM model. However, the result is worse than the LSTM model with only book-and-trades data. There are possible explanations for this: the nlp data points are naively repeated and a time-encoded positional embedding instead may be more appropriate. Considering the difference in frequency between the two datasets, the information of the nlp data point may be misleading for the data points that are “further” from the nlp data point.

5. Concluding Thoughts

5.1. Limitations

So far, we have shown that our model and trading strategy is powerful, able to generate huge returns in only a few hours. However, there are several limitations that traders must consider when implementing this model and strategy in real time.

One significant limitation of high-frequency trading models is the requirement for real-time data. This data is essential to make accurate predictions, but it may not always be readily available, and some data sources may be behind paywalls. Obtaining and analyzing real-time data automatically can be a challenging and time-consuming process, which limits the accessibility of this type of trading.

Another critical limitation to consider is trading costs. High-frequency trading requires making many trades over a short period, which incurs additional trading costs. These costs can quickly add up, and if not accounted for, they can significantly reduce the profits generated by the strategy. Moreover, retail traders often pay a higher trading fee than institutional traders, which further increases the cost of trading.

Furthermore, recent changes in Twitter’s API policy have put restrictions on the data accessible through the platform. Twitter has become a key source of data for natural language processing (NLP), which is an essential part of high-frequency trading. The unavailability of data from Twitter limits the data sources available for analysis, which reduces the effectiveness of the strategy. This also raises questions about the speed at which the data will be available and its accuracy, as a key feature of the model is the immediate availability of the data.

In addition, high-frequency trading models require significant computational power, which may not be feasible for retail traders or smaller trading firms. The predicted power is made possible by complex neural networks which require a generous amount of computing potential, which may not be available on demand. The need for low latency connections further complicates the requirements for infrastructure. Achieving low latency is essential to ensure that trades are executed quickly and at the right time and price. However, implementing the infrastructure required for low latency connections can be expensive and challenging.

Overall, there are a number of limitations with this strategy but none are insurmountable. With the right strategy and with enough capital, we are confident that this model and strategy can be executed for a profit.

5.2. Future Steps

Future steps of the project could include attempts to merge the more promising models into one model or an ensemble of models. More work could also be done to merge datasets of different frequencies. The number of data features could also be reduced using dimensionality reduction techniques like PCA.

Check out more information about WSB2 and other ML projects on the UTMIST project page. You can also find a presentation of WSB2 (and other UTMIST projects) here.

6. Citation

https://ojs.aaai.org/index.php/icwsm/article/view/14550

https://www.sciencedirect.com/science/article/pii/S104244312030072X

https://onlinelibrary.wiley.com/doi/10.1111/j.1540-6261.2004.00662.x

https://www.sciencedirect.com/science/article/pii/S187775031100007X

http://www.diva-portal.org/smash/get/diva2:1110776/FULLTEXT01.pdf

https://www.sciencedirect.com/science/article/pii/S1568494618306707

https://doi.org/10.1016/j.ijforecast.2020.02.008

How To Write and Submit to deMISTify

University of Toronto Machine Intelligence Team — Mon, 03 Oct 2022 03:00:04 GMT

Introduction

For those who have been affiliated with UTMIST for a long time, you’re probably aware that we’ve previously published all articles from the technical writing department on our Medium account. Of course, the writer would always be given credit, but this really limited the scope of what the technical writing department could do. As such, we decided to expand our scope this year with the release of our new publication — deMISTify! Sharing the same name as our bi-monthly newsletter, this new format allows anyone to submit their own articles and have their writing associated with UTMIST!

Background Information

So what is a Medium publication? Well, Medium defines them as “shared spaces for stories written around a common theme or topic, usually by multiple authors.” Basically, think of it like a journal or collection that a group of writers submit articles to. The nice thing is, it’s a great way to reach a larger audience when you’re just starting out! Assuming people enjoy your writing, that is.

So, what is the point of this article? Well, in the past, only the technical writers officially hired by UTMIST could write articles. However, we have received feedback and inquiries regarding individuals who may not have time to write regularly, but would still like to contribute. By creating a publication, anyone, and I mean anyone, can submit their article for everyone to see! Of course, we will personally be reviewing each one, so please ensure that you adhere to the following guidelines when writing. We have standards, you know?

Guidelines for Submission

So if you want your article to be accepted and published, please keep in mind the following:

Your article should be related to a topic in machine learning.
Your article should preferably be centered around a recent paper or topic. You can give relevant background information from older sources, but no one wants to read about how amazing GPT-3 is for the hundredth time.
Your article should discuss the background of the topic, how the model or technology works, why it is significant, and what results researchers have found. This is just a guideline, but it’s the general template all our writers follow.
Include pictures with each section! I will include some samples at the end of previously published articles that do this well. Do not write your article like this one.
Cite your sources! This is partially for academic plagiarism and whatnot, but also because your readers may want to do some further reading into your discussed topic! Feel free to use any official citation format you want (MLA, APA, IEEE).
Make sure to include a nice feature image at the top of your article. A feature image should tell your reader what your article is going to be about, so make it unique and eye-catching!
Do not make your article super long! Medium will give you an estimate of how long of a read your article is. Generally, I recommend keeping it under 6 minutes, unless your topic is particularly complicated. However, if you’re going to write about a complex topic, it’s your responsibility to dumb it down enough for the average reader to understand.

Instructions for Submission

Here is a guideline on how to start writing!

Create a Medium account. Don’t worry, this is free!
On the top right, click your profile picture, then “Write a story” to start writing!

*Once you are finished, and this is the important step, DO NOT CLICK “PUBLISH”! You have to submit your article to our publication first. Once it is approved, we will publish it for you.

3. Send the current VP Technical Writing a message, either on Discord or through email (Discord is preferred), with your Medium username (e.g. @utorontomist). They will add you as an author to the publication such that you can submit your article.

4. Once you are added, click the three dots beside “Publish”, then “Add to publication”, select “deMISTify”, and then “Select and continue.” Now, you can click “Publish”, and it will prompt you to “Submit to publication.”

5. Now, all you have to do is wait for approval from the VP Technical Writing, and you should be able to see your article once it’s approved!

Conclusion

Apart from that, if you have any inquiries, feel free to reach out to the current VP Technical Writing. If you don’t know who that is, you can check the Discord server or on our website. Happy writing, everyone!

Sample Articles

These articles were written by technical writers who have been with UTMIST for two years now. Feel free to message them for any advice on writing articles.

ControllerPose — A New Solution for VR Full-Body Tracking by Charles Yuan
A Brief Overview of Diffusion Models and DALL-E 2 by Shirley Wang
Phrase-Based Unsupervised Machine Translation by Ellina Zhang

Wall Street Bots: Building an Automatic Stock Trading Platform based on Artificial Intelligence…

University of Toronto Machine Intelligence Team — Wed, 22 Jun 2022 02:36:43 GMT

Wall Street Bots: Building an Automatic Stock Trading Platform based on Artificial Intelligence From Scratch

A UTMIST Project by Jack Cai, Lisa Yu, Younwoo (Ethan) Choi, Dongfang Cui, Alaap Grandhi, Demetre Jouras, Kevin Mi, Yang Qu, Zhenhai (Joel) Quan, Thomas Verth.

The Wall Street Bots project is a 6-month challenge to build a stock trading platform utilizing data analysis and machine learning methods. The final product aims to enable user authentication, perform trading, build portfolios, and use automatic portfolio balancing with one of our tested strategies on our web platform. Readers and developers are welcome to try and contribute to this project on our web platform and Github page.

Creating a stock trading platform based on artificial intelligence (AI) is far more sophisticated than predicting stock prices using numerical methods such as momentum, or machine learning methods with deep learning. While the latter is pure data analysis and machine learning, the former covers a lot greater scopes, from not only price prediction, but also portfolio management strategies, market condition analysis, back-testing, data pipeline, and real-time deployment. The Wall Street Bots project is to do all of the above — what if we built an AI not in the perspective of a data science challenge, but in that of practicality (it makes money)? What if we do something similar to a hedge fund, but individualized and open source? What level of complexity will we achieve, and how does it perform relative to the market? The Wall Street Bots project aims to achieve all of that (or partially in some areas). In addition, we studied the role of natural language processing in stock price/trend prediction and whether stock news and market sentiments can be truly helpful in predicting prices. We built a platform that allows users to manually trade stocks via Alpaca API, or with one of the AI strategies we built.

In this Medium article, we will walk you through a high-level overview of how we built the Wall Street Bots. The following article is organized by:

1. Background

Trading stocks using an automatic stock trading platform is nothing new. In fact, banks, hedge funds, and trading firms have been using similar algorithmic trading methods for ages. Ever since the advent of digitalized market order flow, the idea of algorithmic trading has been becoming more and more relevant. Since these trading entities must constantly execute large quantities of orders accurately, it seems far more reasonable and efficient to leave it to a machine algorithm as opposed to humans. However, when algorithmic trading is used by these mega-corporations in practice, it creates many potential issues. For example, the 1987 stock market crash and the 2010 flash crash are widely speculated to be caused by large-scale orders placed by algorithmic trading machines. In addition, these machine learning algorithms are often not as accessible to the public. If the consumer chooses to put their money in one of these banks or hedge funds, a portion of their profit would often be taken away from them. Moreover, the investor would often not have control over what kind of trading strategy the algorithm employs. One might ask, “why would anyone choose their own stock trading strategy when they can just leave it to the Wall Street analysts?” The answer to this question is simple. As more and more regular people enter the stock market in the post-Covid-19 era, retail investors now have more power to influence the market than ever before. The influx of investors, and the tension between global powers, compounded with the high growth of tech companies during the pandemic, have created one of the most volatile markets in the past 100 years. Because of this, we believe that an individualized, open-source, automatic trading platform like WallStreetBots could potentially be another tool for common investors and traders to exert their influence in the market and generate profit. In events like the GME short squeeze, these investors and traders have already proved they can punish bad trading decisions made by institutional trading firms. With a platform like WallStreetBots, these traders and investors will have the same tool that institutional firms use to facilitate trading while also making it more consistent and accurate.

2. The Platform

2.1. Alpaca

Before starting to build a stock trading platform, there are a few issues to consider. First, how can we get consistent real-time quote data? How do we take care of different order types, store stock information, and make the simulation as close to real life as possible? Luckily, Alpaca paper trade API takes care of all of those issues so that we can focus on building the strategy and pipeline. Alpaca offers free API keys for paper trading that allow users to place orders, access account information, and retrieve stock information via the Python Alpaca API library. The Wall Street Bots platform is built based on the Alpaca API. Users can retrieve their free Alpaca API key and secret key from Alpaca’s official website and input them into the dashboard page. Whenever a user places an order at wallstreetbots.org, the same information is updated at Alpaca paper trade and vice versa.

2.2. Tools and Library

The web app was built using the core technologies of docker, Python+Django, and psql. Python was chosen as it is the de facto language for machine learning and keeping the entire codebase in the same language is a huge advantage. Docker is to help with deployment with reproducible builds. Exactly how docker is being used will be elaborated on in the deployment section as well. Psql was chosen for its first-party support with Django. Psql is now the default choice for many web applications unless a specific need can’t be met. This project originally might have required lots of time series data for the ML training, but this part of the project was sliced off, and instead, CSVs were used. Keeping this data out of the database kept the complexity down and decreased implementation time. The frontend of the web app is rendered with the Django templating engine — this keeps out most of the complexity that React/Angular/Vue would bring. As this project was mostly time-constrained, being able to work with the smallest amount of setup possible was a priority when choosing all these technologies.

2.3. Web App Structure

The web app is roughly split into the basic CRUD functionality of the dashboard for showing portfolios and entering orders, and the actual stock trading engine that makes the decisions. This keeps all the Django functionality split from the machine learning parts. The CRUD part is further refined down into different Django apps for further convenience. The homepage, along with all the required models, views, and routes, are split apart from the user data, split apart from how the portfolios and stocks are stored, represented, and the routes that modify them.

Inside the machine learning half of the codebase, there are strategies for each of the ways the portfolio can be rebalanced. These strategies are fed into the generic pipeline that is called by the web app every day to automatically rebalance for each user. Each new machine learning strategy inherits from a generic strategy and overrides with a different rebalancing method. This ensures that adding new strategies is as easy as possible. Below is the file tree for the project.

Top-level file tree for the Wall Street Bots web app and trading model deployment.

2.4. Trading Pipeline

Trading strategy pipeline structure for the Wall Street Bots.

The above diagram shows the pipeline of our automatic stock trading system. We use historical prices along with indicators to train regression models to predict future stock prices and volatility. These outputs are fed into portfolio balancing algorithms to rebalance the portfolio. Each component is independent of the others, which means different pipelines can be combined to create new strategies. Strategies implemented by the Wall Street Bots team strictly follow the above structure. This enables upscaling of future strategies and a cleaner codebase.

The Wall Street Bots project implemented Hidden Markov Model (HMM) and various deep neural networks (DNNs) along with natural language processing (NLP) for price and volatility prediction, and two portfolio balancing strategies: naive equally weighted portfolio and Monte Carlo simulation to maximize Sharpe ratio.

2.5. Deployment and Automation

The entire web app is dockerized, which makes deployment very easy. There are only two docker containers used, one with the Django app and one with a prebuilt psql database. The containers are orchestrated with docker-compose. Again, this was chosen for simplicity over Kubernetes or similar tools. It’s hosted on DigitalOcean using the smallest droplet size. The project is cloned from GitHub, and the .env files have the secret keys added. The droplet itself has a reverse proxy to redirect HTTP/HTTPS traffic to the web app container, and the firewall rules are changed to allow this outside traffic in and out of those ports.

Importantly, none of the actual machine learning needs to take place on the droplet itself. All the training is done on local, offline machines. This keeps production simple and, most importantly, allows the project to use the most affordable droplet option available.

The automation is handled inside the web app itself. Every fixed time period, on trading days, the rebalancing pipeline is triggered. This resyncs the Alpaca account in case they’ve changed, for each user, finds the appropriate rebalancing strategy, figures out what needs to be bought or sold to reach the target percentages, and executes those trades using the Alpaca platform.

3. Data Collection

Data is an essential part of any analysis task. The Wall Street Bots project collects market data, including stock prices, indices prices, stock news, investor comments, and fundamentals from a variety of sources. These sources and methods used are listed below.

3.1. Alpaca Market Data

Alpaca offers stock and index quotes and historical data for all US exchanges. In addition, Alpaca offers a news API for historical and real-time news headlines. Although not in large quantities, these news headlines are used in combination with other sources for sentiment analysis and NLP tasks.

3.2. New York Times News Archive API

The New York Times offers a free news API for parsing NYT historical news headlines.

3.3. Finviz News Headlines

Finviz does not offer any APIs but is a good source for stock headlines. Due to the simplicity of the website, web scraping headlines from Finviz is relatively easy. Scraping news headlines from Finviz is done via the Python Beautiful Soup library. A simple web scraping script that looks for the latest 100 news per stock on FinViz is written. The script is available here.

3.4. r/wallstreetbets Daily Discussion Comments

We access investor comments data via Reddit’s r/wallstreetbets discussion thread. Scraping the Daily Discussion Thread comment from r/wallstreetbets is a lot more complicated than scraping Finviz headlines. The reasons are twofold. First, each daily discussion thread contains roughly 10k comments per day. We scraped r/wallstreetbets in the time frame from 2020–01–01 to 2022–05–13, and even if we only take the top 1.5k comments per day, it added up to more than a million comments in total. Second, while the Reddit PRAW API allows one to get all comments of a thread given the thread’s identifier, there is no pattern in each daily discussion thread’s identifier. To overcome this issue, we adopted an approach similar to [1], in which we programmed Selenium library’s chrome driver to automatically search the thread name and look for the specific class tag in the HTML for the unique thread identifier. We then saved those identifiers to a CSV and used Reddit’s PRAW API to get the top 1.5k most voted comments for each thread. The full script is available here.

3.5. RSS Feeds

RSS feeds from various news sources were also pulled. Mainly headlines and short bodies were of interest. These sources were often traditional newspapers such as The Economist, or The New York Times. The RSS feed is polled for new additions. In practice, this was not an effective source of data as each news source varied too much from one another to group them together in the same dataset. Articles weren’t as frequent as other sources. Historic data was difficult to capture. This source didn’t end up being used to train any of the models.

4. The Strategies

4.1. Stock Closing Price Prediction with NLP

2021 is a phenomenal year, not only because the market reached the highest point in history post-pandemic, but also because the rise of meme stock, platforms like r/wallstreetbets, and individual traders that remind us the market is not solely influenced by banks, trading firms, and large hedge funds, but also by individual traders and public sentiments around the market. The idea that the market is correlated with public sentiment is not new. In fact, the famous study in 2011 by Bollen et al. [2] revealed the correlation between Dow Jones Industrial Average (DJIA) and Twitter sentiment from Google’s GPOMS model that outputs the sentiment across six categories. Specifically, the “calmness” category revealed the largest correlation, while the other categories revealed almost none. The study demonstrated a striking 86.7% trend prediction accuracy with linear regression combining past price and the calmness score to predict DJIA’s next day closing price, while the baseline accuracy based on moving average achieved 73.3%. Despite the strong performance, several things to keep in mind are that the testing period is a short span of time with only 15 trading days, and the period is chosen with the least volatility in DJIA and the absence of large unexpected socio-economic events. In reality, unexpected events frequently occur to skew the stock prices from prediction, and moving average trend prediction accuracy is bounded around 50% as opposed to the 73.3% in the study, especially when the market is volatile.

Unfortunately, we could not directly adopt Bollen et al.’s approach to Wall Street Bots due to the GPOMS model used in the study being a closed source model. Instead, we changed our attention to large language models developed in recent years thanks to transformers and attention mechanisms. The specific model we will be using is BERT [3], which stands for Bidirectional Encoder Representations from Transformers. We do not go into details about how BERT works in this article, but one of its key functions is to encode a line of text into a 768-D vector, called a sentence embedding vector. A sentence embedding vector captures the meaning of a sentence, and while it does not tell us directly the calmness of a sentence, a neural network trained based on these embeddings certainly can. We used FinBert [4], a fine-tuned version of BERT trained on financial data. Therefore, instead of using a calmness score from Twitter data to do linear regression, we trained deep neural networks (DNNs) with FinBERT embeddings on a variety of text sources, combined with past prices and indicators such as VIX and QQQ to predict the next day closing price. (Note that FinBERT directly classifies the text into three sentiment scores — positive, neutral, and negative. We extract the FinBERT embedding as the token embedding of the last hidden state.)

4.1.1. Trend Prediction Multi-layer Perceptron Based on Stock News Headlines

The first model we tried was a multi-layer perceptron (MLP) to predict the trend of a stock (whether going up or down) solely based on headlines. We did this as the starting point because stock news headlines are the easiest to retrieve via web scraping from FinViz and the Alpaca News API. We gathered 50 growth stocks from NASDAQ with similar fundamentals and gathered their headlines from 2016 to 2022, and we obtained a total of roughly 100,000 headlines. The train/test sets are the FinBERT embeddings of these headlines, and the labels are whether the stock price goes up or down after the headline is released (1 = up, -1 = down). We tried it for both the hour and the day before and after the news was released. However, no matter what training techniques we apply or what hyperparameter we use, we could not bring the test loss down. The figure below shows a typical loss vs. iteration curve for this model on one of the training sessions.

Train loss and test loss for trend prediction MLP.

The above figure is a typical overfitting problem, but downsizing the model and adding regulations such as dropout did not make it any better — the training loss ended up going up along with the test loss. After more research, we found out that there is something fundamentally wrong with the approach. First, it is almost impossible to predict the price with news. By the efficient market hypothesis, when news comes out, the stock price reflects it almost immediately. After all, as a side project, we do not have the resource nor the zero latency to react faster than the market. In reality, Wall Street analysts anticipate the news, which is a totally different problem and a much harder one to learn. Furthermore, we did not include the past prices and other indicators in the training set; this is actually a naive assumption because stock trends can not be separated from their prices. For instance, if a stock is already highly overvalued it is likely to go down despite the news being positive. We gave up on this approach and instead looked into r/wallstreetbets comments for price/trend prediction.

4.1.2. GME Price Prediction MLP, CNN, and LSTM with r/wallstreetbets Daily

Discussion Comments Reddit comments, on the other hand, are much better sources for investors’ sentiment than news headlines because first, unlike the news headlines that come from a few presses, Reddit comments have diverse sources that represent a greater population. Second, Reddit comments contain more emotions (easier for sentiment analysis) than news headlines that are mainly statements of facts. The r/wallstreetbets daily discussion thread is the place where most investors in r/wallstreetbets express their opinion towards the market, therefore a great source for sentiment analysis. While r/wallstreetbets investors only represent a small portion of the investor population, we believe that they represent a much bigger portion of the meme stock investors. For this reason, we used the Reddit comments in predicting GME price as opposed to indices that represent the general market such as SPY and NASDAQ.

Like the previous approach, we run FinBERT on the Reddit comment and extract the embeddings. But this time, we consider past prices, FinBERT sentiment outputs, SPY index, VIX index, and NASDAQ index. The overall train/test set features and label is shown below.

We collected ~ 1.1 million comments across 598 trading days, from January 1st, 2020 to May 13th, 2022. The input features include the previous close price, the positive/neutral/negative sentiment scores of the Reddit comments given by the FinBert model, the previous VXX close (as a substitute for VIX), the previous SPY close, the previous QQQ close (as a substitute for NASDAQ), the number of upvotes of the Reddit comment, the hour of the day that the comment was posted, and the 768 dimension FinBert embedding of the comment. The target we are trying to predict is the next closing price of the stocks. We tested the following models (as listed below) and some with data augmentation from other stock prices such as AMC, TSLA, AAPL, and KOSS, which are tickers frequently mentioned in r/wallstreetbets or have a high correlation with GME.

N-day Moving Average (Benchmark): We compute the past N-day moving average to be the next predicted price for GME, and we selected the N that yields the highest trend prediction accuracy (50.0%) and the lowest Mean Absolute Percentage Error (MAPE) (6.24%) to be the benchmark model.

Multi-layer Perceptron (MLP): Using the full features (𝑁 × 777) as inputs, the 3-layer MLP model with ReLU activations tries to predict the normalized daily return based on the next closing price. To avoid look-ahead bias, we split the data for training and validation sets chronologically (80% and 20%). The MLP model reached a validation trend prediction accuracy of 54.3% and MAPE of 10.87%. These results are worse than the moving average benchmark strategy. This may be due to the fact that each individual comment is treated as a single data point, leading to multiple comments with different sentiments within one day all having the same target return for the model to predict. This input/target relationship only makes sense if all the comments each day have very similar sentiments.

Convolutional Neural Network (CNN): We first consider GME price data only. To address the issue raised in the previous MLP model section, we concatenated the input features of all the comments on the same day to a matrix of dimensions 500 × 777 (since we scraped 500 comments each day) and treated each of these “daily” matrices as an individual data point with one corresponding target return. The final dataset has dimensions606 × 500 × 777after dropping the days with less than 500 comments. We split the data randomly to get the training and testing sets (80% and 20%). The final CNN network with two convolution layers and two dense layers with max-pooling and ReLU activations achieved validation trend prediction accuracy of 49.5% and MAPE of 14.49%. Again, this model underperformed the moving average benchmark. This is likely because after grouping the data points based on the comment date, the new dataset for GME is only of size 606, which is too small for the model to train on.

CNN with Data Augmentation: We try to solve the problem of insufficient data by augmenting the GME data with price data from different stocks whose returns have a high correlation with GME returns. This leads us to develop the CNN model with data augmentation. We picked AMC, KOSS, TSLA, and AAPL, and performed the train-test split chronologically (80%, 20%). With the same CNN model, we achieved validation trend prediction accuracy of 58.1% and MAPE of 8.7%. The trend prediction accuracy greatly outperformed the moving average benchmark.

Long Short Term Memory (LSTM): A LSTM model is created with full features (𝑁 × 777) in the training set X, and we averaged each day’s comment embeddings and sentiment scores, creating a total of 611 data points. Due to the lack of data points and high dimensional features, Principal Component Analysis (PCA) was used to reduce the FinBert embedding to 20 dimensions that kept 63.2% variance. Despite the dimension reduction, the test loss was not able to converge. The overall best validation trend accuracy is 51.0% and 29.52% for MAPE.

LSTM Distilled:
A distilled version of LSTM is created without the FinBERT embeddings, with only 8 features in each time step. The test loss was able to converge, and we found the best model with 256 hidden dimensions using past N=10 data points to predict the next closing price. The overall best validation trend accuracy is 61.8% and 6.20% for MAPE.

Train and Test Loss (left) and Ground Truth and Predicted Price (Right) for LSTM Distilled.

Flow chart for models tested.

Comparison of the above models is shown below.

Overall, our best model (LSTM Distilled) achieved significantly greater validation trend accuracy (61.8% vs. 50.0%) than the benchmark, with a slight improvement in MAPE (6.20% vs. 6.24%). Indeed, GME is a volatile stock, therefore predicting prices close to the target is challenging, which explains why it is so hard to decrease MAPE. Still, a trend accuracy of 61.8% may be enough for building a portfolio management algorithm that gains a positive expected return in the long run. Note that compared to Bollen et al’s, our validation period is significantly longer (120 days vs. 15 days) and our benchmark accuracy is lower (50.0% vs. 73.3%) for the stock and time period we picked.

4.2. Stock daily opening price prediction with HMM

Hidden Markov model (HMM) is a sequential model with the assumption of the statistical Markov process and is broadly applied in time series modeling, e.g., disease progression and text segmentation. The Markov process assumes that each observation is only dependent on a short history, and each observation is only dependent on the current hidden state. The interpretation of the hidden state depends on the context of the problem, and here it can be interpreted as underlying stock market conditions. In the Wall Street Bots project, we use each day’s intraday minute price to predict the next day’s opening price.

Hidden Markov model structure.

The above figure illustrates a standard HMM (in the context of stock opening price prediction):

Z_(1) …, Z_(t+1) represents the hidden state, which is a representation of hidden stock market conditions.
Y_(1), …, Y_(t) represents the current day’s closing price on a minute basis, and Y_(t+1) represents the next day’s opening price.
Arrow represents dependency.

Implementation: Here, we assume that the stock price at each point is only dependent on its current hidden state and each hidden state is only dependent on the previous hidden state. We use the hmmlearn implementation for the model architecture. To build our training data, we standardized the same-day closing price on a minute basis by subtracting the first closing price. We construct one time series as the current day’s closing price on a minute basis, followed by the next day’s open price. The ultimate goal is to predict the next day’s opening price using the previous day’s closing price on a minute basis. The rationale behind it is that we assume that the pattern stays the same during market time and aftermarket. As the market opens, the stock price stays in the same hidden state as the last closing price. Note that it is not possible to trade aftermarket, so during deployment, we run HMM right before the market closes and assume that in the remaining time the stock price remains stable and within the same hidden states.

There are several reasons to use HMM here: one is because HMM has shown its strong inference and prediction power in many other research and use cases, we believe that HMM can also have a promising performance in the stock price prediction scenario; another reason is that HMM outputs a distribution of predicted values. Under certain model configurations, it outputs a normal distribution with a mean equal to the expected value of the stock and variance proportional to its volatility. Thus, we are able to learn the market fluctuation from the model and directly use it for portfolio balancing algorithms such as Monte Carlo search for maximizing the Sharpe ratio.

Result: We evaluate the model performance by two metrics: Mean Square Error (MSE) and trend prediction accuracy. We define the trend prediction accuracy as the following: the proportion of the correct prediction on the next day’s stock trend (going up or down) and MSE as usual:

Opening price prediction by HMM on MSFT from 2019 to 2021.

During evaluation, we used Microsoft data from 2019 to 2021 and trained an HMM with 15 hidden states. The blue plot represents the true price, and the red plot represents the prediction outputs from the model. HMM is able to make predictions very close to the true values with an MSE of 21.0 compared to 3623.6 from the benchmark moving average model. With this setting, we’re able to achieve a 66.2% accuracy in trend prediction. To show that the prediction is meaningful, we compare it with the proportion of days where there is an uptrend (53.2%), showing that the HMM trend prediction greatly outperforms random guessing.

Remarks: One limitation of our current design is that only stock price is considered as features. To fully utilize the data we have, the next step is to implement the HMM with multivariate features where each observation is dependent on both the current hidden state and the multivariate features.

While it seems like HMM outperforms all the DNN models with a trend accuracy of 66.2%, it is important to keep in mind that HMM is only predicting the next day’s open price, while the DNNs are predicting the closing prices. These two types of models are not comparable as predicting the next day’s closing price is a harder problem with more randomness. Still, HMM remains a powerful algorithm, as our result has proven.

4.3. Portfolio Balancing with Predicted Stock Prices

Given our predicted next-day closing and opening price, and assuming we only take long positions, we built our trading strategy as follows:

4.3.1. Equally Weighted Portfolio

4.3.2. Monte Carlo Simulation for Maximum Sharpe Ratio Portfolio

Sharpe ratio is a measure of a portfolio’s risk-adjusted returns. It has the formula

Monte Carlo is a common method used for portfolio optimization. We approximate the max Sharpe portfolio by randomly assigning portfolio weights with each weight between 0 and 1 such that the total weights of all the stocks sum up to 1. Then we calculate the expected return and variance of the portfolio.

Again, we use historical price data to estimate the future variance/covariance matrix.

We then calculate the portfolio’s Sharpe Ratio with the above formula assuming a risk-free rate of 2%. We repeat this process thousands of times to get an optimal portfolio weighting that maximizes the Sharpe ratio. See the below graph for an example Monte Carlo analysis — the max Sharpe portfolio is labeled red and the min variance portfolio is labeled blue.

Example graph for Monte Carlo analysis from [5]

After getting the optimal portfolio weights for the stocks, we can place trade orders equivalent to the difference between the optimal weights and our current portfolio weights to balance our portfolio each day.

5. Demonstration

This section demonstrates all components in Wall Street Bots working together as a whole. It is also a step-by-step guide on how to use the web app to try out the strategies.

First, you will need to create a free account at https://alpaca.markets/
After you create your free account, navigate to the paper trade dashboard at https://app.alpaca.markets/paper/dashboard/overview
Beside Your API Keys, click the View button and then click Regenerate Key.
Copy the Key ID and Secret Key.
Now navigate to http://wallstreetbots.org/. Follow the prompt and create an account. Login in and you will be directed to the dashboard page.
Paste your Key ID and Secret Key in the Alpaca ID and Alpaca Key fields respectively, then click UPDATE CREDENTIAL. (Note: by doing this step, you agree to share your Alpaca API credential to Wall Street Bots database)
You are all set! Now you can place orders, view stock information/portfolio history, and build your own portfolio on the Position page.

8. Of course, don’t forget to choose one of our pre-built strategies that automatically balance the portfolio for you.

6. Closing Remarks

In conclusion, predicting stock movement is hard, and building a platform that integrates various models and strategies and executes orders on behalf of the users is even harder. In this project, we went through all the steps to build a “mini hedgefund” — from selecting and verifying strategies to software development and deployment. We hope this article inspires you to do something similar or even consider contributing to this open source project. Wall Street Bots is a continuous effort and new features and strategies will constantly be developed in the future. Stay tuned!

Once again, you are welcome to check our codebase in our GitHub repository here or try out our website at wallstreetbots.org. Collaboration is also welcomed in the form of pull requests.

Thanks for reading!

References

[1] Smith, A. Reddit Wallstreetbets — Web Scraping Tutorial. https://algotrading101.com/learn/reddit-wallstreetbets-web-scraping/

[2] Bollen, J, Mao, H, Zeng, X. Twitter mood predicts the stock market. https://arxiv.org/pdf/1010.3003.pdf

[3] Devlin, J, Chang, M, Lee, K, Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/pdf/1810.04805.pdf

[4] Araci, D. FinBERT: Financial Sentiment Analysis with Pre-trained Language Models. https://arxiv.org/pdf/1908.10063.pdf

[5] Ahmed, S. How to Construct an Efficient Portfolio Using The Modern Portfolio Theory in Python?
https://towardsdatascience.com/how-to-construct-an-efficient-portfolio-using-the-modern-portfolio-theory-in-python-5c5ba2b0cff4

An Application of Deep Learning Model to Specify Cardiovascular Diseases via Analyzing ECG Diagrams

University of Toronto Machine Intelligence Team — Wed, 22 Jun 2022 00:23:54 GMT

A UTMIST Project by Hanni Ahmed, Maanas Arora, Nabil Mohamed, Thomas Zeger, (Timothy Kwong), and Yan Zhu.

* Reference for image: A. J. Huber, A. H. Leggett, and A. K. Miller, “Electrocardiogram: Blog chronicles value of old, but still vital cardiac test,” Scope, 19-Dec-2017. [Online]. Available:
https://scopeblog.stanford.edu/2016/09/21/electrocardiogram-blog-chronicles-value-of-old-but-still-vital-cardiac-test/. [Accessed: 24-May-2022].

Introduction

Cardiovascular disease (CVD) is the leading cause of death in the world. According to the World Health Organization, about 17.9 million people died from CVDs in 2019, representing 32% of all global deaths [1]. While there are many types of CVDs, we focus on heart arrhythmias — one of the most common cardiac abnormalities among all the CVD categories.

Heart arrhythmia is characterized by irregular heartbeats and is caused by the dysfunctionalities of the electrical system in the heart. The diagnosis of arrhythmia is based on the comparisons of patients’ ECG recordings with the normal healthy ECGs and the identiﬁcations of abnormal heart rhythm [2]. An electrocardiogram (ECG) measures the electrical activity of the heart; it is helpful to doctors as it can help diagnose arrhythmias non-invasively and decide on the best course of treatment. Unfortunately, distinguishing heartbeats on ECGs can be time-consuming and diﬃcult — the collected signals usually contain a lot of noise that signiﬁcantly increases the time required for analyzing the ECG morphology and the probability of misinterpretations [3][4]. Thus, automating the classiﬁcation process can help doctors to identify which type of ECG pattern a patient has quickly and more accurately.

This motivates our project to develop supervised machine learning models for classiﬁcation of ECG heartbeats. Nowadays, many researchers believe that many healthcare and biomedical problems can be solved more eﬃciently by applying machine learning strategies. It is worthy of learning how to utilize computation to aid health science development, as it might become one of the main research directions in the future. In particular, we draw upon two diﬀerent methods used in literature [3] and [5] to develop convolutional neural networks that classify heartbeats from ECG recordings. We are interested in comparing the two diﬀerent architectures, evaluating their advantages and drawbacks, and seeking the model that performs the best for heart arrhythmia classiﬁcations.

Key Words: Arrhythmia, Heartbeats Classiﬁcation, Convolutional Neural Network, Long Short-Term Memory, Deep Learning

Background: What is ECG

The heartbeats are primarily controlled by a group of autorhythmic cells lying on the heart’s right atrium, namely the sinoatrial node (or SA node). The electrical signals generated by the SA node spread through the entire heart and result in regular muscle contraction/relaxation cycles. An ECG measures and records the electrical potential changes of these signals. A healthy heart should have a normal rate (between 60–100 cycles per minute) and a constant pattern that contains the P wave, QRS complex, and the T wave. These waves correspond to the contraction and relaxation of the heart atria and ventricles. Many cardiac diseases can be identiﬁed from the ECGs; for instance, Figure2 shows the ECGs of some typical CVDs.

Figure1: Segments In ECG Signal [12]

Figure2: Normal ECG vs. ECGs for CVDs [12]

Related Works

CNN is the most commonly used model for biomedical image processing among deep learning architectures. It is computationally cheap compared to the conventional deep neural network and is good at analyzing spatial information [6]. It has been widely applied for medical image classiﬁcations, such as classifying lung diseases on chest X-ray diagrams and brain tumors on Magnetic Resonance Imaging (MRI) images [7]. Recently, CNNs are also popular in the domain of ECG analysis and classiﬁcations. The following two prior works exhibit how the CNNs are applied for such tasks, and they also inspire our project:

A Deep Convolutional Neural Network Model To Classify Heartbeats BY U. Rajendra Acharya et al.[3]

This article proposed a 9-layer CNN model to detect cardiac abnormalities (especially arrhythmia) and classify the heart arrhythmias into five categories: non-ectopic (N), supraventricular ectopic (S), ventricular ectopic (V), fusion (F), and unknown (Q). Figure 1 illustrates the CNN structure, and Figure 2 shows the detailed information on diﬀerent heartbeats categories. The input ECGs are ﬁltered to remove high-frequency noise, and the results have shown that the model is robust enough to handle the noisy dataset. The model was trained on the open-source PhysioBank MIT-BIH Arrhythmia database [8]. When it was trained with the original, non-processed dataset (imbalanced as the majority of data belong to the N class), the accuracy of the CNN was 89.07% and 89.3% in noisy and noise-free ECGs, respectively. When the CNN was trained using the augmented data, the accuracy increased to 94.03% and 93.47% in original and noise-free ECGs, respectively.

This work motivates our ﬁrst approach to the classiﬁcation problem, as the CNN model is easy to implement while producing acceptable accuracy.

Figure3: CNN model layers [3]

Figure4: Explanations for five categories [3]

2. Automated Atrial Fibrillation Detection using a Hybrid CNN-LSTM Network on Imbalanced ECG Datasets BY Georgios Petmezas et al. [5]

Although CNN is well-known for its outstanding spatial information processing ability, it can’t process temporal data well. Therefore, while the ﬁrst article proposed an easy and intuitive solution, some researchers argue that combining CNN with an RNN model, such as the LSTM, would further improve the model’s accuracy and robustness.

For instance, in the Automated Atrial Fibrillation Detection using a Hybrid CNN-LSTM Network on Imbalanced ECG Datasets, Georgios et al. proposed “a hybrid neural model utilizing focal loss, an improved version of cross-entropy loss, to deal with training data imbalance”. The spatial features are initially extracted via a Convolutional Neural Network (CNN) and are then fed to a Long Short-Term Memory (LSTM) model for temporal dynamics memorization. Instead of classifying the heartbeats into the ﬁve classes deﬁned in the ﬁrst related work, Georgios et al. classiﬁed ECGs into four rhythm types, namely normal (N), atrial ﬁbrillation (AFIB), atrial ﬂutter (AFL) and AV junctional rhythm (J). The model was trained on the MIT-BIH Atrial Fibrillation Database [8][9] and achieved a sensitivity of 97.87% and speciﬁcity of 99.29% using a ten-fold cross-validation strategy.

[note: Atrial ﬁbrillation is also a type of heart arrhythmia]

Figure5: CNN-LSTM Architecture from Automated Atrial Fibrillation Detection using a Hybrid CNN-LSTM Network on Imbalanced ECG Datasets [5]

Dataset

The dataset we used for this project is the PhysioBank MIT-BIH arrhythmia database [8]. “It contains 48 half-hour excerpts of two-channel ambulatory ECG recordings, obtained from 47 subjects studied by the BIH Arrhythmia Laboratory between 1975 and 1979” [8], and has been widely used in many biomedical/machine learning projects. It is also the dataset used in the ﬁrst related work.

Figure6: ECG Samples For The Five Categories We Aim To Classify [3]

Figure7: Sample ECGs from MIT-BIH Arrhythmia database [8]

Preprocessing

The resulting data was preprocessed in two steps. In the ﬁrst step, a wavelet transform was used to denoise the signal. In the second step, we used the Pan-Tompkins algorithm to detect R peaks and segment the ECG signal into heartbeats.

Figure8: The original input ECG diagram

Figure9: ECG after denoising and oversampling

Signal Denoising: Daubechies-6 Wavelet Decomposition Transform

The wavelet transform we performed was a daubechies-6 wavelet decomposition transform. “The Daubechies wavelets are a family of orthogonal wavelets deﬁning a discrete wavelet transform and characterized by a maximal number of vanishing moments for some given support” [13]. The resulting wavelets, after the transform, are noise-free and retain the necessary diagnostics information contained in the original ECG signal [3][10], and therefore, can deliver the heartbeat information more precisely.

Peak Detection: Pan Tompkins Algorithm

To avoid using the dataset for expert-labeled locations for the R-peaks and generate results closer to a live detection environment, we used the Pan Tompkins algorithm to detect R peaks and segment the ECG signal into individual heartbeats [3].

Figure7: Visualization of the Pan-Tompkins Algorithms for R peak detection

The algorithm works as follows: a derivative of the ECG signal was computed with the diﬀerence coeﬃcients of [1, -2, 2, 1]. On the moving average of this derivative, all points at which the signal moved from a negative derivative to a positive derivative (i.e., peaked) were located and marked as QRS complexes. These QRS complex locations were used to extract QRS intervals; the maximal point of each signal interval was marked as a potential R peak. Using an adaptive noise algorithm described in [3][11], detected peaks that fell below a signal threshold were removed as false R peaks. Finally, to avoid detecting T waves, any two detected peaks that occurred within a period shorter than a refractory period were scanned for the larger peak, and the other peak was removed. The resulting R peak detections were used to segment the ECG signal into individual heartbeats to be used as model inputs.

Addressing Class Imbalance

Once these two steps had been performed, the class imbalance issue had to be resolved: Around 90% of the heartbeat sequences obtained from the MIT-BIH Arrhythmia database correspond to the non-ectopic (N) class. Balancing of classes present in data is important for preventing overﬁtting and improving sensitivity scores of the model, and so we attempted two approaches to addressing the class imbalance in the dataset: generation of synthetic data and oversampling. We chose to stick with the latter since our model achieved similar if not better performance despite being much simpler to implement. It must be noted that this data augmentation was performed only on the training set. This ensures that our test set contains only real heartbeat sequences present in their usual distributions and thus serves as a fair measure of performance in a practical setting.

Generation of Synthetic Data

We generated new synthetic heartbeat sequences from the obtained sequences by introducing random Gaussian noise to the existing sequences belonging to minority classes.

Oversampling

This method addresses class imbalance by simply duplicating examples for relatively underrepresented classes in the dataset.

Network Architecture

Model 1: Convolutional Neural Network

The convolutional neural network we implemented was based on the aforementioned paper by Acharya et al. (Related Work 1). There are three one-dimensional convolutional layers, each of which uses a Leaky-ReLU activation and is followed by a max-pooling layer. The convolutional layers are important for picking up the key features of the heartbeat sequences, such as the QRS complexes, as well as the locations of diﬀerent kinds of arrhythmia. Afterwards, the output feature map gets passed into a 3-layer fully-connected network, also using Leaky-ReLUs, followed by a softmax in the ﬁnal layer. The exact model architecture, with kernel sizes used for the convolutional layers, is portrayed in the diagram below.

Figure8: The layers in the CNN model Figure9: Visualization of CNN model [3]

Figure9: Visualization of CNN model [3]

Model 2: CNN-LSTM Combination

While the convolutional layers are ﬁt to extract local features of the data, they are not adequate for analyzing long-distance relations present in sequential data. This is because, in CNNs, long-distance relations are lost due to the local nature of the convolution kernels.

For that reason, when dealing with sequential data, recurrent neural networks (RNNs) models are preferred over CNNs.
However, as has been proven by the results obtained by the ﬁrst model, local features are still important when dealing with ECG data. Furthermore, the convolution kernels can learn to denoise the data, thereby eliminating the need for the denoising preprocessing step, which an RNN would not be able to do. For those reasons, the convolutional layers are still desirable.
To leverage the inductive biases of both architectures, in our LSTM model, ﬁrstly, convolutional layers are applied to the data for denoising and local feature extraction purposes. Then, the extracted features are fed into LSTM layers so that the sequential nature of the data is exploited.
The LSTM model comprises three initial 1D convolutional layers followed by the LSTM layer and two dense linear layers. For all layers, the leaky ReLU activation function is employed. To learn more about the CNN-LSTM model, please refer to Figure 5.

The model architecture is presented in detail in Figure10, shown below.

Figure 10: The layers in the LSTM_CNN combination model

Methodology

We trained the CNN model and CNN_LSTM models on the preprocessed dataset for 20 epochs, using categorical cross-entropy loss and the Adam optimizer with a learning rate of 0.001 and beta values of 0.9 and 0.999. The training-testing split is 90%-10%.

In order to compare and evaluate the performances of 2 models, in total, we did four experiments, including training and testing:

CNN model on the denoised dataset
CNN-LSTM model on the denoised dataset
CNN model on the noisy dataset
CNN-LSTM model on the noisy dataset

The training loss and accuracy for models 1 and 2 on denoised and noisy datasets are illustrated in Figures 11 to 18, respectively.

Figure 11: Training loss for CNN on denoised dataset (0.006459 after 20 epochs)

Figure 12: Training accuracy for CNN on denoised dataset (99.93% after 20 epochs)

Figure 13: Training loss for CNN-LSTM on denoised dataset (0.014350 after 20 epochs)

Figure 14: Training accuracy for CNN-LSTM on denoised dataset (99.82% after 20 epochs)

Figure 15: Training loss for CNN on noisy dataset (0.006163 after 20 epochs)

Figure 16: Training accuracy for CNN on noisy dataset (99.94% after 20 epochs)

Figure 17: Training loss for CNN-LSTM on noisy dataset (0.013887 after 20 epochs)

Figure 18: Training accuracy for CNN-LSTM on noisy dataset (99.83% after 20 epochs)

Results & Discussion

Comparison with Acharya et al.

It is worth mentioning that our CNN model outperformed the model proposed by Acharya et al. in terms of model accuracy on both original and denoised test sets (see table 1 below). Furthermore, it seems that denoising the ECG waveforms has little eﬀect on model accuracy for both implementations, which indicates the robustness of the models and agrees with the conclusion made by Acharya et al. In addition, Acharya et al. used stochastic gradient descent for training, while we found that using Adam can further improve the training and testing accuracy using the same neural net and training for the same number of epochs.

It is important to note that this perceived improvement in accuracy is partially due to the fact that the test set used by Acharya et al. also contained synthetically generated data. This allowed them to ensure that both test and training sets had a balanced class distribution. However, this is not a rigorous method of evaluation because data that is generated from the training set can be present in the test set and vice versa, causing the model, to some extent, to lose generality. As mentioned earlier, our approach was only to use oversampling on the training set, which meant that our test set would have relatively more examples from the non-ectopic (class ’N’) than from any other class. Because of this, reaching higher accuracy levels on the test set may have been an easier task for our models (which had more ’N’ class examples to train from) than those used by Acharya et. al.. However, the data distribution in our test set is a better representation of the reality, and the accuracy we obtained is more helpful in demonstrating the model’s performance in real-world applications.

Table 1: Reported accuracies for our CNN and CNN-LSTM models on normal and denoised train and test sets. To compare the eﬀect of noises in the input dataset, we also include the results of inputs without denoising. The negligible change in accuracy after denoising indicates that the transformation had a negligible eﬀect on model performance.

Table 2: Reported test accuracies for CNN and CNN-LSTM models on diﬀerent arrhythmia classes.

Comparison with Georgios et al.

Since Georgios et al. used a diﬀerent dataset in the second related work, it is harder for us to conclude whether our model can beat their performance. Due to constraints on compute resources, our team was limited to running experiments on the MIT-BIH Arrhythmia dataset (which had ~24 hours of ECG recording data) with 20 epochs for training our models. Because of this, a rigorous and fair comparison between our model and that of Georgios et al. (who used ~230 hours of training data and trained models for ~100 epochs) is not possible.

Comparison of CNN and CNN-LSTM

Yet, while evaluating the CNN-LSTM model on the same dataset with the same classiﬁcation method as the CNN model, the CNN model exhibits higher training and testing accuracy on both denoised and noisy datasets. Table 2 also shows the test accuracies of 2 models in all ﬁve arrhythmia classes. The CNN model beats the CNN-LSTM model for all the classes except for class F, where it underperforms by less than 3%. Furthermore, our simple CNN model is 2x faster to train than the CNN-LSTM model, implying the CNN-LSTM approach is more computationally expensive and power-consuming.

Therefore, according to our experiment results, a CNN model might be more suitable for classifying ECG diagrams, as it is easy to implement and is robust enough to the noises in the database.

For the sake of completeness, we also compare the average sensitivity, average speciﬁcity, and average PPV of the two models we implemented and the CNN model proposed by Acharya et al. The results are summarized in Table 3 below:

Table 3: Comparison of average sensitivity, speciﬁcity, and positive predictive (PPV) values across diﬀerent models.

Sensitivity, speciﬁcity and PPV are metrics for evaluating the model performances (higher values are better). They are deﬁned as [5]:

Conclusion

In this project, we implemented two deep learning models that take ECG signals as input and can classify the heartbeats into five categories. Among all the heart diseases, we focused on cardiac arrhythmia and used the PhysioBank MIT-BIH arrhythmia database for model training and ﬁnal evaluations. The results have shown that both models performed similarly well and are robust to the noise in the dataset. Despite having fewer layers, the CNN model produces slightly higher accuracy overall and in most arrhythmia classes. After hyperparameter tuning and optimizer selection, our CNN model outperforms the model proposed by Acharya et al., achieving test accuracy of 98.32% and 98.19% on denoised and original datasets, respectively.

Reference

[1] “Cardiovascular diseases (cvds),” World Health Organization, 11-Jul-2021. [Online]. Available:
https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds). [Accessed: 24-May-2022].

[2] “Arrhythmias — what is an arrhythmia?,” National Heart Lung and Blood Institute, 24-Mar-2022. [Online]. Available: https://www.nhlbi.nih.gov/health/arrhythmias/. [Accessed: 24-May-2022].

[3] U. R. Acharya, S. L. Oh, Y. Hagiwara, J. H. Tan, M. Adam, A. Gertych, and R. S. Tan, “A deep convolutional neural network model to classify heartbeats,” Computers in Biology and Medicine, vol. 89, pp. 389–396, 2017.

[4] Z. Faramand, S. O. Frisch, A. DeSantis, M. Alrawashdeh, C. Martin-Gill, C. Callaway, and S. Al-Zaiti, “Lack of signiﬁcant coronary history and ECG misinterpretation are the strongest predictors of undertriage in prehospital chest pain,” Journal of Emergency Nursing, vol. 45, no. 2, pp. 161–168, 2019.

[5] G. Petmezas, K. Haris, L. Stefanopoulos, V. Kilintzis, A. Tzavelis, J. A. Rogers, A. K. Katsaggelos, and N. Maglaveras, “Automated atrial ﬁbrillation detection using a hybrid CNN-LSTM network on imbalanced ECG datasets,” Biomedical Signal Processing and Control, vol. 63, p. 102194, 2021.

[6] C. Cao, F. Liu, H. Tan, D. Song, W. Shu, W. Li, Y. Zhou, X. Bo, and Z. Xie, “Deep learning and its applications in biomedicine,” Genomics, Proteomics & Bioinformatics, vol. 16, no. 1, pp. 17–32, 2018.

[7] S. S. Yadav and S. M. Jadhav, “Deep convolutional neural network based medical image classiﬁcation for disease diagnosis,” Journal of Big Data, vol. 6, no. 1, 2019.

[8] A.L. Goldberger, et al., PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals, Circulation 101 (23) (2000) e215–e220.

[9] G.B. Moody, R.G. Mark, A new method for detecting atrial ﬁbrillation using R-R intervals, Computers in Cardiology 10 (1983) 227–230.

[10] B.N. Singh, A.K. Tiwari, Optimal selection of wavelet basis function applied to ECG signal denoising, Digit. Signal Process. A Rev. J. 16 (3) (2006) 275–287.

[11] J. Pan, W.J. Tompkins, A real-time QRS detection algorithm, IEEE Trans. Biomed. Eng. BME-32 (3) (1985) 230–236.

[12] L. Sherwood, Human physiology: From cells to systems. Boston, MA, USA: Cengage Learning, 2016.

[13] “Daubechies wavelet,” Wikipedia, 28-Nov-2021. [Online]. Available: https://en.wikipedia.org/wiki/Daubechies_wavelet#:~:text=The%20Daubechies%20wavelets
%2C%20based%20on,moments%20for%20some%20given%20support. [Accessed: 24-May-2022].

Handwriting Recognition Using Deep Learning

University of Toronto Machine Intelligence Team — Sun, 12 Jun 2022 17:52:55 GMT

A UTMIST Project by Justin Tran, Fernando Assad, Kiara Chong, Armaan Lalani, and Eeshan Narula.

Introduction

Recognizing handwritten text is a long-studied problem in machine learning, with one of the most well-known datasets being the MNIST [1] for handwritten digits. While recognizing individual digits is a solved problem, scientists have been looking for ways of recognizing the full corpus of text at once, as it makes digitizing documents easier. We present a solution to this problem of Handwritten Text Recognition (HTR), together with an overview of current advances in the field.

Motivation and Goal

Recognizing written text is key to many applications that depend on digitizing documents, including healthcare, insurance, and banking. The biggest challenge arises due to the high variation of styles in the way people write, especially in cursive form. While many software applications already implement HTR (e.g. photos on iPhones) these are far from perfect, and research in the field is still very active.

The project aims to develop an algorithm that can read a corpus of text by segmenting it into lines and applying a transformer-based implementation of HTR [2]. This method was developed and published last year and was shown to outperform the then state-of-the-art models, notably CRNN implementations like that of Shdiel et al. [1].

Related Work

HTR is a sub-problem of Optical Character Recognition (OCR) — converting typed or handwritten text from images into machine-encoded text. OCR systems are divided into two modules: a detection module and a recognition module. The detection module aims to localize blocks of text within the image via an object detection model, while the recognition module aims to understand and transcribe the detected text.

Typically, text recognition is achieved through combined CNN and RNN architecture, where CNNs are used for image understanding and RNNs for text generation. However, in more recent studies, the use of Transformer architecture has shown significant improvements in text recognition models. This has led to the development of hybrid architectures, where CNNs still serve as the backbone of the model.

To further improve text recognition, recent studies have investigated the ability for Transformers to replace these CNN backbones. This has resulted in the development of higher accuracy models such as TrOCR, an end-to-end Transformer-based architecture that features both pre-trained image and text Transformers.

Dataset

We used the IAM dataset [3] to train and test our model. It includes 13,353 images of lines of handwritten text, which are used as our inputs to our HTR algorithm (still, word and document levels are also provided). Given a document, the bounding boxes for them are also provided and were used to train the segmentation step of our model. The data are freely available.

Network structure (Segmentation)

The algorithm used to perform word segmentation is based on the paper ‘Scale Space Techniques for Word Segmentation in Handwritten Documents’ [5]. We initially attempted a ResNet architecture to accomplish this task, but issues with training forced us to pivot to an alternative method. The algorithm proposed in the paper considers the scale-space behavior of blobs in images containing lines. The underlying concept of this algorithm is the utilization of Gaussian filters to generate an image’s scale space, which involves developing a family of signals where details are progressively removed.

In order to accurately segment the lines within a document, projection profile techniques were utilized. The vertical projection profile is defined by summing the pixel intensity values along a particular row. The lines are identified by determining local peaks in the projection profiles after applying Gaussian smoothing to deal with false positives.

After line segmentation, second-order differential Gaussian filters are applied along both orientations of the image in order to form a blob (a connected region) representation of the image. By convolving the image with the second order differential Gaussian, a scale-space representation is created. The blobs in this representation appear brighter or darker than the background after this convolution. By altering the parameters of the filters, the blobs will transition from characters to words.

Network Structure (Classification)

The architecture we used was based on the paper “TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models” by Li et al. [2]. We implemented the architecture for classification after the segmentation step. It is composed of an encoder and a decoder which are described below.

The image is first resized and broken into a batch of smaller 16x16 patches, which are then flattened to form what we call patch embeddings. Based on the position of each patch, they are also given a position embedding. These inputs are then passed into a stack of encoders composed of a multi-head self-attention module and a feed-forward network. This process is shown the figure below.

Image retrieved from [2] explaining the classification network.

The outputs of the encoder are then fed into a decoder architecture, which is almost the same as the encoder architecture. The difference lies in adding an “encoder-decoder” attention module between the two original encoder components. The embedding from the decoder is then projected into the size of the vocabulary, and after applying the softmax function and beam search, we determine the final output.

Training

Because the segmentation was done without Machine Learning, only the classification network needed to be trained. Input sentence lines were first resized to 384x384 resolution so they could be divided into 16x16 patches. We split the IAM dataset into 90% training data, and 10% validation data. For the test set, a subset of 5 full-page handwritten documents were given to the model starting at the segmentation algorithm, and the output of the classification network was evaluated using the Levenshtein distance [4] divided by the length of the ground truth label. This metric will be referred to as the character error. The testing pipeline begins with the segmentation algorithm so that the model is evaluated in its entirety.

Recursive implementation of the Levenshtein distance of two strings a and b.

The network was trained using cross-entropy loss and the adam optimizer, with a learning rate of 0.00001 and a 0.0001 weight decay. The inverse square root schedule is used as the learning rate scheduler. The initial warmup learning rate is 1e-8 with 500 warmup updates. The final hyperparameters are the number of epochs and batch size, which are set at 50 and 2, respectively. Due to financial constraints, we could not conduct a hyperparameter search.

Results

Overall, our project was able to predict the text of the test documents fairly well. The test documents were read with an average character error of 0.17. For comparison, when trying with a CRNN approach, the character error was 1.16. For reference, if the model had output the word “Hyundai” when the label was “Honda,” the character error would be 0.6. During classification training, the final average training cross-entropy loss was 0.118, and the final average validation loss was 0.386.

However, it should be noted that the data used were either pre-segmented text lines in the case of classification training or documents where text lines were reasonably on level, i.e., parallel to the page and not slanted. The model does not perform as well when the text lines are not on level due to the limitations of the segmentation algorithm.

Plot of the training and validation loss over epochs

Conclusion

The goal of our project was to create a model which takes in a handwritten document and outputs that text in a way such that the original handwritten content is discernible. The low character error from the test set, as well as the low validation and training errors demonstrate that our model was able to accomplish this goal with some limitations.

Limitations of our model include being unable to segment text lines which are not level with the page. Another major limitation is the inablility to classify text lines which include too much noise, such as a non-white background. For future work, we would like to improve upon the segmentation portion, perhaps with a state of the art segmentation neural network, as well as employ filtering techniques to remove excessive noise.

References

[1] Deng, L. (2012). The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6), 141–142.

[2] Li, Minghao & Lv, Tengchao & Cui, Lei & Lu, Yijuan & Florencio, Dinei & Zhang, Cha & Li, Zhoujun & Wei, Furu. (2021). TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models.

[3] U. Marti and H. Bunke. The IAM-database: An English Sentence Database for Off-line Handwriting Recognition. Int. Journal on Document Analysis and Recognition, Volume 5, pages 39–46, 2002.

[4] Gooskens, Charlotte & Heeringa, Wilbert. (2004). Perceptive evaluation of Levenshtein dialect distance measurements using Norwegian dialect data. Language Variation and Change. 16. 189–207. 10.1017/S0954394504163023.

[5] Manmatha, R. and Srimal, N., n.d. State Space Technique for Word Segmentation in Handwritten Documents. University of Massachusetts.

Employee Attrition Factors Prediction

University of Toronto Machine Intelligence Team — Sun, 12 Jun 2022 16:33:00 GMT

A UTMIST Project by Maliha Lodi, Afnan Rahman, Kevin Qu, Omer Raza Khan, Yinan Zhao, and Matthew Zhu.

Project Overview

An organization is only as good as its employees, as they provide the company with its competitive advantage. As a result, it is imperative that companies understand how to retain their top talent. Using machine learning and data analytics, this project aims to identify areas of improvement across all departments of a company by determining factors that might be responsible for employee attrition in different teams.

To this date, no automated process to predict employee attrition is known to be used by any company. Yet, organizations face large amounts of tangible and intangible costs due to high employee turnover rates. Thus, we constructed a pipeline for predicting employee attrition using machine learning. In this project, we predicted combinations of factors that may be responsible for significant employee turnover rates. We provided insights to Human Resources teams on how to increase their employee retention rate by making necessary improvements.

Dataset

The data used for this project is from the IBM HR Analytics Employee Attrition & Performance Dataset. It contains 35 categorical and numerical features for 1470 unique employees who work in one of nine different job roles, along with the information on whether they have quit or not. As the project largely depends on the predictions done by supervised learning, the labels in the dataset are crucial for this project.

Data Preprocessing

Before inputting our data into our models, we preprocessed it to allow our models to properly learn the patterns present in the data and enhance the overall performance. Our dataset contained many categorical variables, which we transformed into numerical data using one-hot encoding. We chose this specific encoding method to avoid introducing a relationship between the values of the variables; we simply wanted to convert them to a numerical format.

While initially exploring the data, we also found that our dataset was imbalanced: there were drastically more employees who did not quit their job than those who did. We used Synthetic Minority Oversampling Technique (SMOTE) to fix this problem by artificially growing the number of employees who quit. This technique was chosen as the research papers we referenced also utilized it. We also did not want to decrease the size of the dataset by deleting a certain number of rows where an employee stayed with the company in our balancing attempt.

The final preprocessing step we took included removing irrelevant variables, such as employee number, and aggregating common variables like daily_rate, hourly_rate, and monthly_rate into one column.

Classifiers

The project begins by using five binary classifiers that classify employees based on whether they would quit their job or not. The models are Logistic Regression, Decision Tree, Random Forest, XGBoost, and Support Vector Machine (SVM). These models were chosen since past research has used them to address similar classification problems.

Logistic Regression

Logistic Regression is a machine learning model that determines the probability of two possible outcomes in a binary classification setting. For feature selection, a wrapper auto feature selection method was used. As well, the hyperparameters of the logistic regression models were tuned using random search. The final model achieved an accuracy of 91.89%, which is better than the research paper’s accuracy of 85.48%.

Support Vector Machine (SVM)

The Support Vector Machine (SVM) classifier plots data points in high-dimensional space and tries to create a hyperplane to most adequately classify the data; nearby data points are called the “Support Vectors.” To refine our model, we used a Sequential Forward Selector with the wrapper method to achieve an accuracy of 88%, cutting down the data to just around 20 features. To further improve the model, we ran hyperparameter tuning with Grid and Random search. We reached a maximum accuracy across all validation and tests of 92%, which is a 7% increase from the reference paper.

Decision Tree

Decision Trees are widely used in both regression and classification problems and utilize a tree-like data structure to make decisions on how to split input data into two subsections in each iteration repeatedly. Seventeen features were selected in the feature selection stage, increasing the baseline model accuracy by about 6%. This stage was conducted using a Sequential Forward Selector. The hyper-parameter tuning did not cause any improvements in the model metrics, leaving the best possible model with a final accuracy of 88.8%, about 9% better than the research paper referenced.

Random Forest

The random forest model is an ensemble model that is a collection of multiple decision trees where the predictions from each decision tree get stored, and the prediction with the highest votes is yielded as the model’s final prediction. To train the model, 15-fold cross validation was selected, which provided an accuracy of 93%. This is almost a 10% increase from the reference paper. The k=15 was chosen due to the smaller size of the dataset. As the feature selection and the hyperparameter tuning techniques decreased the accuracy, no such fine-tuning techniques were included in the final model training.

XGBoost

The XGBoost (eXtreme Gradient Boosting) model is also a tree-based ensemble method that is one step up from the random forest model. It creates decision trees in parallel and combines the weaker trees to create a collectively stronger model. To train the model, XGBoost’s feature importance technique was used, which discarded 18 features. Using the selected features, the model was trained using 15-fold cross validation, and Random Search CV was used to fine-tune the model. The final model yielded 92.73% accuracy, around 3% higher than the reference paper.

The table below summarizes the steps taken to produce the best results:

Table 1: Classifier Results.

Clustering

Through clustering, we identified common trends within certain groups of employees, especially what distinct “clusters” of employees will quit their jobs.

K-Means

As our data was a mix of both categorical and continuous data with a high number of features, we used K-Means clustering on our one-hot encoded data.

T-SNE

To visualize the clusters, we reduced the number of features using t-SNE and grouped the data using K-Means. T-SNE (t-distributed Stochastic Neighbour Embedding) is an unsupervised, dimensionality-reduction algorithm for visualizing high-dimensional data. It is similar to PCA, but while PCA seeks to maximize variance by preserving large pairwise distances among datapoints, t-SNE only preserves small pairwise distances as a measure of similarity. In this way, t-SNE uses a probabilistic approach to capture the complex, non-linear relationships among features, while PCA uses a linear, mathematical approach. As such, t-SNE was able to capture similarities among data points and reduce computation costs for K-Means, which is used to group the employees predicted to quit. The figure below demonstrates the results of running t-SNE and K-Means with three clusters upon all models.

Figure 1: Visualization of all the t-SNE plots for all models

Comparison among clusters

The cluster analysis was done to find the combinations of dominant features within each cluster that would lead to employee attrition. Below is a general summary of all the clusters identified from the employees that our classifiers predicted would quit.

Figure 2: Visualizing how the numerical features (Age, Distance from Home, and Average Remuneration Rate) differ across the three clusters

Cluster 0

This cluster consists primarily of older, experienced male employees who are single. These employees have worked with the company the longest and do the most overtime work. However, they are paid the least. In general, they are also the least satisfied with their working environment out of all the other clusters. Additionally, these employees are largely in the R&D department. Compared to the other clusters, these employees also have the lowest job level and the poorest job satisfaction. Overall, these employees can be classified as ones who work the most but get paid the least.

Cluster 1

This cluster mostly contains males in their mid 20’s to mid 30’s who are paid the most compared to the other clusters and have the lowest number of years in their current roles. They tend to work overtime as well. Similar to Cluster 0, most of these employees are part of the R&D department with job titles such as ‘Research Scientist’ and ‘Laboratory Technician’. These employees also have the highest level of environmental satisfaction and job involvement among all the other clusters. Even though these employees chose to leave, they were quite satisfied with their jobs. This cluster also contained the greatest number of divorced employees out of all the clusters.

Cluster 2

This cluster is comparatively the most balanced in terms of gender, with a male to female ratio of 1.8. Of all the clusters, the employees here switch roles the most, and they have the lowest number of years in their current roles with their current managers. They score surprisingly high in work-life balance and most of them are married too. This cluster also has a relatively equal number of employees who work overtime and those who don’t.

Figure 3: Visualizing how the categorical features (Gender, Relationship Status, Whether they do overtime or not, Department) differ across the three clusters

Future Steps

Large corporations and small businesses can utilize our models’ results to understand their employees better, thus mitigating employee turnover. Companies can pass their employee data through the trained models, find common factors between employees projected to leave, and then make the necessary changes to ensure a lower employee turnover rate. In the future, it may be interesting to experiment with more advanced machine learning models, such as a neural network, to see whether such a model’s predictions are more accurate than the traditional machine learning models used in this study.

RealTime — Real Estate Price Prediction

University of Toronto Machine Intelligence Team — Sun, 12 Jun 2022 14:53:16 GMT

RealTime — Real Estate Price Prediction

A UTMIST Project by Arsh Kadakia, Sepehr Hosseini, Matthew Leung, Arahan Kadakia, and Leo Li

RealTime is a machine learning project for predicting home prices in the GTA. It is a significant improvement on last year’s project, RealValue, by focusing on more data than ever before. Thanks to the effective use of data from a variety of demographic and public sources and an iterative database update process, RealTime is able to provide highly accurate home pricing decisions for brokerages and regular homeowners alike.

Motivation & Goal

In Toronto, sold home prices continued to rise at an explosive growth rate from January 2021 to January 2022, increasing by over 24% year-over-year. Despite interest rate increases slowing this rise recently, being able to afford a good-quality home is now harder than ever.

In this environment, finding value and having full autonomy and understanding about your housing decisions are critical. However, for different stakeholders, this experience is more challenging in various aspects.

For brokerages, the fast-moving nature of home sales and an ever-changing housing market lead to significant negotiation uncertainty. The significant discrepancy between new home prices and old home prices leads to decisions being taken based on the “hunch” or “intuition” of realtors who may not have the best grasp on the market. A data-driven solution that is adaptable to changing times and considerate of different types of information can provide a distinct market advantage to the brokerage that has the solution, as they can recommend better advice to their clients. In the short and long-term, these strong suggestions will lead to better relationships and consequently success.

For clients, this experience becomes more challenging as you would like to stay “in budget” for your dream house or your first investment property. The volatile market leads to people making once-in-a-life home decisions on a “hunch” and then dealing with the consequences later. In many cases, clients overpay for homes, putting further financial strain on themselves and hamstringing their ability to grow their wealth later on in the future. A solution that is driven by real-time insights and accurate predictions can lead to a more confident decision by clients.

RealTime provides a solution to the problem of unclear market changes by allowing brokerages and clients to have access to an ML-backed real estate solution that draws upon different data sources that are updated to close recency.

The goal of RealTime is to achieve 8% mean absolute percentage error (MAPE) on a final set of attributes determined by the available data sources.

A notable subgoal is to create a real-time, extensive database that is up-to-date with information from many data sources. Without such a database, the promise of a data-driven solution that is aware of current market developments doesn’t quite exist.

Dataset

One of the most prominent improvements to last year’s project which has been drastically restructured in RealTime is the dataset. In last year’s project, RealValue, the dataset was manually constructed and only consisted of 157 homes with very limited corresponding attributes. This year, RealTime’s dataset is composed of over 30,000 homes across Toronto, giving us a much more diverse and larger set of data. Not only that, but each house also contains several more attributes, both numerical and categorical, as well as over 40 amenities, which further strengthen the dataset. Some amenities include nearby universities, public transit, restaurants, and schools. In addition, new attribute choices include parking space, number of bedrooms, number of bathrooms, and type of home (whether a detached, condo, etc.).

Another major enhancement made in RealTime is that unlike last year’s approach where we manually made additions to the dataset, our new dataset follows an automated and iterative process, which always keeps our dataset up-to-date and makes regular updates to ensure reliable and accurate home prices. This also means that our database is constantly growing, which plays a major role in improving our performance.

Workflow

The data collection process is an iterative process that involves fetching the house data through an API request, saving this data into our database, and updating the data with more detailed house information and amenities periodically. This process is an automated process and executes every 12 hours to ensure our database is always giving us up-to-date and reliable information. To automate this process, we use GitHub Actions to create Docker container actions as our method of choice for creating the builds. We operate the actions on a Google Cloud Virtual Machine, ensuring that our hardware is consistent and can reliably update our database. Our database of choice is MongoDB due to its fast data access for read/write operations.

Cleaning/Feature Engineering

Normalization

The most complicated part of creating such a solution is appropriately pre-processing the different sources of received data.

Our group classified received features/attributes into three types: numerical, “easy” categorical, and “hard” categorical.

Attributes such as bedrooms, bathrooms, and longitude/latitude are numerical examples. For numerical data, two types of normalization were used: max normalization and min-max normalization. For numerical attributes with more unique data points, min-max normalization was the preferred choice. Otherwise, numerical attributes with fewer unique points were treated with max normalization. These decisions were taken by testing various configurations and noting patterns between the most accurate trials.

“Easy” categorical features consist of discrete features which could be directly one-hot encoded without further data processing. Examples include the house attachment style (e.g. attached, detached, semi-detached) and basement development (e.g. finished, unfinished, no basement).

“Hard” categorical features consist of features that need to be specially handled in order to be useful. Examples of these include a dictionary of amenities near a house (e.g., a certain house has 3 subway stations, 1 university, 3 schools, and 4 supermarkets within a 2km radius) and a dictionary of room areas.

Specific Choices

Example plots of analysis tools.

Once we incorporated data from different APIs, we realized that the best feature space involved significantly fewer attributes.

In order to reduce the number of attributes, we used a variety of assessment plots, including box-and-whisker plots and correlation plots, as shown above.

Box-and-whisker plots were utilized to consider the general trend that different values of a single attribute presented, and how similar they were to other values of the same attribute. If they were considered “too” similar, they were consequently removed from later training.

Correlation plots were used similarly. Again, if the overall attribute slope was considered negligible (close to zero), it was removed from later training.

Using these analysis tools, we reduced our feature space, which helped with increasing training accuracy and, more importantly, reducing overfitting issues that allowed for a growth in test accuracy.

Training

Upon further inspection of our own dataset, we realized that a significant portion (approximately 65%) of our houses didn’t have stored images. For that reason, we had to evaluate the dataset with that issue in mind.

The training was conducted in two parts:

1. Training Without Images

In this part, we had access to approximately 17,000 fully complete homes. However, we could not utilize the notable attribute of images.

2. Training With Images

In this part, we had access to approximately 6,000 homes with photos. While this is a lower number of homes, we were able to include photos as a feature in this subset.

Training Without Images

We chose a set of regressors to test our dataset upon, with justifications provided as follows:

1. Support Vector Regressor

Method: Extending support vector machines into the regression space, with the “kernel” method learning higher-dimension transformations that allow for the separation of data. A kernel is used to transform the data to a high-dimensional space, in which a linear regression can be applied to generate a hyperplane to represent the data. The objective function of SVR also allows for some error tolerance.
Benefits: This regressor was chosen due to its strength in working with higher-dimensional data and reducing prediction errors to a certain range. In addition, SVR has been used in the literature for housing price prediction.

2. Multi-layer Perceptron

Method: Fully connected feedforward artificial neural network, consisting of layers of neurons with an activation function between each layer.
Benefits: MLPs are a basic neural network widely used in machine learning, with the capacity to solve complex problems.

3. Ensemble of Regression Trees (Extra Trees Ensemble Regressor)

Method: An ensemble method that uses many decision trees to make a prediction. In a regression problem, an average is taken of the decision tree results. Generally, the decision trees are fitted to the entire dataset.
Benefits: Ensemble of Regression Trees has been used in the literature for housing price prediction. Ensemble methods using decision trees are useful in that they create a more diverse set of regressors, which can then be averaged. Outliers do not have a significant impact, and decision trees can be easily interpreted and visualized. In addition, assumptions about the data distribution are not required.

4. Gradient Boosting (Gradient Boosted Decision Trees)

Method: Gradient boosting is usually used with an ensemble of decision trees (as in the case of Scikit-Learn) and allows for the optimization of loss functions.
Benefits: The optimization of loss functions allows this method to usually perform better than random trees.

5. K-Nearest Neighbours Regression

Method: This algorithm extends K-Nearest Neighbours classification to regression. In order to make a prediction on the result of some input data point, this algorithm takes the K closest points to that data point (based on some distance metric). It then takes an average of the labels of those K closest points, using that as the prediction output. No actual model is actually built.
Benefits: This algorithm is simple to implement, and no assumptions need to be made about the data. No model needs to be built, because this algorithm just relies on computing distances between points. K-Nearest Neighbours has been used in the literature before for housing price prediction.

6. Random-Forest Model

Method: This method is similar to Extra Trees, except that the decision trees are fitted to random subsets of the dataset.
Benefits: Since Random-Forest fits decision trees to random subsets of the data, overfitting is controlled and accuracy can be improved. Ensemble methods using decision trees are useful in creating a more diverse set of regressors, which can then be averaged. Outliers do not have a significant impact, and decision trees can be easily interpreted and visualized. In addition, assumptions about the data distribution are not required.

7. Huber Regressor

Method: A linear regression model that uses the Huber loss function instead of squared error loss. In the Huber loss function, outliers are penalized less compared to in squared error loss.
Benefits: This model is less sensitive to outliers because of its loss function, making it useful for a large housing price dataset in which outliers are present.

8. Ridge Regressor

Method: A linear regression model that uses a linear least squares loss function with an additional L2 regularization loss term in order to penalize overfitting.
Benefits: Due to the regularization loss term, this model is effective at preventing overfitting in data. This is particularly desirable for housing price prediction, in which we wish to obtain a generalizable model.

Eventually, our best model was the ensemble of regression trees, with the rest of the results declared below.

Training With Images

Following our determination of the regression tree ensemble as the best model, we decided to utilize two architectures to train on the homes with photos.

The first architecture was the same architecture utilized in last year’s RealValue project as follows:

In the above architecture, the CNN network was chosen to be EfficientNetB0 due to its high accuracy on image classification problems.

The second architecture involved incorporating photos into the training of a traditional model such as a regression tree ensemble.

We did so by “flattening” the photos into a high-dimensional vector. This was done with the use of an autoencoder. The overall design looks like the following:

The input photo is passed through an encoder, and the consequent embedding is combined with the statistical input and eventually passed into the ensemble for inference.

Altogether, after considering both inputs, the following statistics emerge.

In the 6,000 houses with photos, it is clear that photos had a positive impact on the test accuracy, improving it by nearly 1.5% MAPE.

However, it is also evident that the number of houses limits the effectiveness of the overall dataset. While using the full dataset of 17,000 houses, the test accuracy improves by 1% MAPE.

For that reason, we can conclude that the dataset length matters more so than its width for this project. Altogether, we would prefer to have both in future projects, as it is clearly demonstrated that having more attributes increases accuracy.

Takeaways

Creating a self-updating database require APIs that are mostly unchanging, limitless, and need little-to-no maintenance. Utilizing APIs in a manner that serves those three qualities was our biggest challenge.

For example, real estate APIs are now either band-limited or provide little in the way of extensive attributes.

Our next steps now include making our request allocations more efficient so we can serve more homes. In addition, we want to draw upon further data sources and integrate more APIs so that we can build a wider dataset while maintaining a high length.

Conclusion

Altogether, we realized that high-quality data is arguably the most important component of an elite machine learning algorithm. Despite creating interesting architectures in last year’s project, our large dataset this year led to a 6% MAPE improvement. As our best MAPE is less than 10%, we can now conclude that our real estate price estimator is highly accurate.

You can check out our work on the real-time.ca website and try our price estimator for yourself!

SmileDetector — a new approach to live smile detection

University of Toronto Machine Intelligence Team — Sat, 11 Jun 2022 23:17:05 GMT

SmileDetector — A New Approach to Live Smile Detection

A UTMIST Project by Noah Cristino, Bradley Mathi, Daniel Zhu, and Mishaal Kandapath.

https://github.com/NoahCristino/SmileDetector

Smile Detector detects smiles in images using a new multi-step approach, which consists of performing facial recognition and then running the detected facial vectors through our custom C-Support Vector Classification model. By reducing the model input size and approaching this problem in a new way, we were able to produce a model that can perform classifications with high accuracy in fractions of the time that traditional models take. This technology has many applications, including engagement analysis, marketing, and photography.

Motivation and Goal

The goal of this project was to utilize a machine learning model to be able to detect smiles in pictures and video footage. More specifically, we wanted to create a model that was more optimized and was able to run faster than previous models. Competing models use image input and attempt to find a smile in the full image, meaning that a lot more time is needed for computation. Our goal with this model was to use a hybrid approach that first uses the DLib library to turn the image into vector points. The ML model then directly analyzes those vector points, meaning there are far fewer data to process. We hoped this would lead to similarly accurate results and a much faster processing time.

Dataset

GENKI-4000 Dataset

We used the GENKI-4000 Dataset compiled by the MPLab at UC San Diego to train our model. This dataset consists of 4,000 images that all contain faces that are either smiling or not smiling. This dataset is tagged with 1 = smiling, 0 = not smiling.

Data Processing

DLib applied to face

Our program takes the input image, resizes it, and converts it to grayscale. This image is then passed into the DLib model, and facial vectors are generated. We use the vector representing the bridge of the nose to rotate the facial vectors so that the nose bridge is perfectly vertical. The vectors are then localized to deal with the fact that the face can be a different size in each image. We then take these vectors and add the ones around the mouth to a list. This list is passed into our SVC model, which detects whether the image contains a smile or not.

Network Structure

SVC Model Diagram

The DLib Facial Detect model uses a Histogram of Oriented Gradients (HOG) feature combined with a linear classifier, an image pyramid, and a sliding window detection scheme described in this paper. This model is pretrained and provided in the DLib python library. Our SVC model was trained using lists of vectors as input data, which mapped to a list of booleans that indicate smiling or not smiling.

Additional Features

video processed by model

In addition to creating the basic smile detector, We were able to use the same model to create a video parsing feature. Using this feature, one would be able to upload a video, and then receive an edited version of that video containing only the sections where someone is smiling. This worked by analyzing keyframes in the video to see if they were smiling or not, as opposed to processing the entire video, which made for faster processing times. Using this technique, we were able to process large video files in a relatively short amount of time and increase the usefulness of the model we created.

Future Applications

Mood Analysis

Live video smile detection has many potential uses in products or features to increase the quality of life of a consumer. For example, this technology could be used to create a feature for a camera app that would take pictures whenever the subject was smiling. This could be especially useful for taking pictures of children or babies, as it is often difficult to keep them smiling for long periods. Another use for this technology would be as a passive application that someone could enable while watching videos or streaming content. This would allow someone to see how much they smiled while watching the content and could be used to gauge how much someone enjoyed the video/stream they were watching.

Conclusion

Our project resulted in a much faster model than competing smile detection models while maintaining high accuracy. The speed allowed us to perform a live video analysis up to 720 FPS, allowing us to analyze webcam footage in real-time. This makes it possible to use our model for applications previously not possible with other competing models. Furthermore, we can analyze higher resolution cameras in real-time and run our model alongside others due to its small size. This allows us to generate additional data that can be used alongside the user’s emotion for analyzing engagement. Shrinking the number of required resources to create a faster model has allowed us to use this technology in new ways and outperform existing models.