Priemo - Combinative Emotion Analysis

Inspiration

Last year @Petru was working in the IT department of a non-IT company during a time when the company was migrating its phone systems from ISDN to VoIP (traditional telephones to IP-based phones). As expected, the workers needed some time to adapt to the changes before they were ready to provide the necessary support. However, the number of calls for support exceed all the expectations to such a point where people could not manage all the calls.

One important thing that was noticed here was that a lot of trivial and mainstream problems that the clients brought could have been self-solved just by carefully reading the manual provided. On the other hand, there were some urgent problems (for example, calls randomly dropping etc.) that had to be addressed immediately by the organisation.

As we all met each other during the Hackathon, we found that this problem is evident in all companies in their customer support departments. Just following a FIFO strategy to address the customer calls is not enough since the urgency is rarely assessed in these situations.

The emotional state of the customers almost always reflects the urgency of a support call. And so we thought - "Why not prioritise the incoming support calls according to the customer's emotional state?"

About our hack

The hack is essentially a REST API that identifies the emotion of the speaking customer and prioritises his call by assigning a score to him. This works in two parallel steps. His voice and his choice of words, together determine the overall emotion. Depending on the prioritisation score assigned to him, he is put in the queue at the appropriate place. In case of starvation of a call in the queue, all low priority calls are incremented with priority after every new incoming call arrives for processing.

How we built it

As the end product was supposed to be a REST API, we decided to skeleton it using Flask.
We started off by trying to classify emotions from voice samples by training a keras-based CNN on a human voice (modulated for various situations) data set. After a rigorous search on the web, we finally procured the dataset we needed from https://zenodo.org/record/1188976 and http://emodb.bilderbar.info/index-1024.html. However this was not all. As the two datasets were incompatible with each other, we had to bring them both to a common ground for our model to be trained on them. There was a bit of preprocessing involved in this step.
The next step was the conversion of speech to text (we used IBM Watson for it). After this, the text was sent to a simple sentiment analyser model that we got off as an already open-sourced project.
To integrate results from both these models, we had to assign prioritisation scores for which we developed a simple algorithm.
Finally, we built a not-so-amazing demo webapp that could call the API and maintain a queue based on several different incoming voice samples.

Challenges we ran into

The main challenge was and still is to deploy the API on the VPS that is provided by REG.RU
Audio encoding and transferring over HTTP with base64 encoding. Later we figured out a way to not use base64 and directly transfer a .wav file in the POST request instead, which worked pretty well.
There are not many emotionally labeled human voice datasets that are available on the web.
Lack of time to prepare for a proper presentation.