tFUSE | Devpost

Inspiration

Sentiment analysis uses natural language processing to detect the polarity (positive, neutral, negative) of a text, which can be used to draw conclusions about a population. It is a common challenge that programmers have solved for years and is used by companies around the world to gauge how people feel about certain topics such as the stock market. We are all passionate about Python and machine learning, so we put our skills to the test to gain more experience with this area and have some fun!

What it does

tFUSE is a model coded in Python that uses machine learning to perform sentiment analysis on a random selection of tweets to find if tweets carried positive or negative sentiment. Tweets underwent preprocessing, which included lowercasing, regex, stopword removal, normalization, and stemming.

How we built it

Libraries

It was decided that Tensorflow was to be used to construct the model, utilizing additional Python data science modules such as Pandas and Numpy.

Preprocessing

Due to the nature of tweets and other fast text message based communication, it was imperative that preprocessing was necessary for effective natural language processing (NLP) and deep learning.

The most fundamental and perhaps most effective approach is lowercasing all the text data. Although this technique is more useful with sparse instances of words in smaller datasets, lowercasing still proved beneficial by improving validation accuracy by approximately 1% to 3%.

Tweets often contain expressions that may not contribute to the overall sentiment, such as user handles (@janedoe), hashtags (#ignitionhacks2020), and links (www.ignitionhacks.org). These expressions often contribute to greater noise in the dataset since many require additional context, such as understanding a user’s history or reading what is on the linked webpage, to provide a substantive sentiment relation. It is important to note that, in this dataset, it was found that hashtags were beneficial for understanding sentiment, at least by an empirical measure of the accuracy metric. A possible hypothesis is that certain emotions were associated with these tags, and could in fact be used for sentiment analysis on its own. Regex was used to clean up the data by removing handles and links, as well as punctuation and numbers.

To further reduce noise and improve sentiment analysis performance, stopword removal, normalization, and stemming was used. The English language contains many short and common words that do not give additional context for NLP, such as ‘a’ or ‘the’. These stopwords were removed by tokenizing the sentences and checking for matches against a list provided by NLTK. To keep normalization times low, a simple NFKD (Normalization Form Compatibility Decomposition) unicode normalization was implemented to remove special characters. Afterwards, the Porter stemming algorithm reduced words to their root form, allowing for more consistent sentence encoding.

Perhaps the most crucial portion of the algorithm’s preprocessing is text enrichment using techniques such as word and sentence embedding.

Encoding

Most NLP algorithms utilize a basic tokenization and a pretrained embedding layer such as ‘keras.embedding’. In order to improve upon this and create a model that could be effective on different types of text data, word vectorization and sentence encoding was tested. Both methods convert strings of text into vectors of floats based on semantic similarity, given by techniques such as cosine or euclidean distance. This is superior to conventional embedding layers as it gives a measure of semantic similarity between different words in the case of word embedding, as well as word-sentence context in the case of sentence embedding.

A disadvantage of using word embedding is losing context and thus being susceptible to spelling mistakes, expressions that require multiple words, abbreviations, and words with similar meanings. After testing, sentence encoding proved to be a better approach, and consistently provided close to a 2% increase in validation accuracy compared to the same model with default keras embedding. However, it must be noted that these encodings came at the cost of long run times and there are also averaging techniques to create contextual relationships between word vectors. Nonetheless, it is important to focus on the real strength of sentence encoding, which is providing associations between words such as “Ignition Hacks” and “Hackathon”, as well as greater flexibility for the model to handle new data or train on other languages.

Model Type

The large amount of training data, as well as the lack of sentiment clarity in many tweets (as manually observed), led us to use deep learning models, with a larger number of deep learning layers. This would allow the model to learn more subtle patterns within the data, and make full use of the dataset.

Train-Test Split

Due to there only being one feature, the tweet itself, it was difficult to collect information about the dataset without using natural language processing tools. It was, however, noted that in the training data, there was an exact 50-50 split between tweets which were labeled to be of positive sentiment and tweets of negative sentiment. Therefore, it was decided that the split between training and validation data would be done at random, with approximately 15% of the data set to be validation data and the remaining 85% to be training data.

Modelling

A variety of models were created and tested on the training data, such as one with only dense and dropout layers, a convolutional neural network, a bidirectional long short-term memory (LSTM) network, and a gated recurrent unit (GRU) neural network. These were all trained for 5 to 40 epochs, depending on the time it took for them to finish each epoch, and each of their layers were tinkered with, such as by adding dense layers and dropout layers, and changing parameters such as the number of nodes in a layer or the activation function used. These neural networks were tried with the Keras tokenizer, as well as with the encoded data which was created with the Universal Sentence Encoder.

Challenges we ran into

We ran into difficulties when encoding the contestant judging data because of two computer and Colaboratory crashes, which delayed our project for hours. Fortunately, the Ignition Hacks team (Grace) gave us a more flexible submission time so we could finish our encoding. In addition, we could not implement all the preprocessing that we wanted and some of our models did not yield the desired results. We were also limited to optimization techniques from the TensorFlow API and the time constraints of the hackathon. While pre-trained models and other transfer-learning approaches from robust NLP models already had optimized pre-training hyperparameters tuning approaches, they require immense resources to train (with runtimes well beyond the scope of this hackathon) or result in using pretrained weights and biases that have already exhausted large external datasets outside of the hackathon and will deliver preeminent results by simply applying the model on the given dataset.

Accomplishments that we're proud of

We are proud of the model we accomplished, which has model and preprocessing features that we all had to learn about during the hackathon. In addition, all our group members have little to none experience competing in hackathons, but we were able to work well together remotely to create a project we are proud of.

What we learned

We all learned more about Python and machine learning as well as how to use other resources like Colaboratory to test our code. We tested many models and layers, which gave us all a better understanding of TensorFlow. In addition, we learned how to use preprocessing to get more accurate results.

What's next for tFUSE

In order to possibly receive more accurate predictions, an ensemble of different models could be used. They would be able to make predictions on each test tweet, and the results could be averaged to find a hopefully better estimate of the tweet sentiment by reducing the effects of individual models overfitting. A limitation which we faced, however, is that the short timespan of the hackathon did not allow for adequate training of some models, and this step would require additional computational resources and time to do.

When looking at the training graphs for many of the models, the validation accuracy appeared to level off, while the training accuracy continued to increase. This is classic overfitting, and was manually solved in our case; however, this could also have been solved more efficiently using early stopping, which can be done using the tf.keras.callbacks module to save past models in a H5 file before reverting to the model which was least likely to be significantly overfitted.