Espial | Devpost

Inspiration

Kids lose focus so easily. Unless you give them something that they really like, it's next to impossible to get their attention. Well, we found it out the hard way. Ankur's brother just couldn't get a hold of his English class. He ended up failing miserably. Comprehending that his brother had a fear for books just like other kids, Ankur suggested him to watch informational YouTube videos that served the same purpose. However, his vehement efforts went in vain as his brother would just end up watching cartoons on YouTube. So, we came up with Espial. Through the use of Neural Style Transfer, Espial allows you to not only convert boring documentaries to cartoons but also provides a bunch of other perks to improve both one's grammar and general knowledge.

What it does

It performs neural style transfer to make boring videos like documentaries look like cartoons. We then convert the voice in the video to text to analyze it. Small kids have a hard time trying to understand complex sentences. We simplified the complex sentences in the video so that kids could get a better understanding. Part Of Speech Tagging was performed to provide kids with all the parts of speech that were used during the video. Kids love picture books. We improvised on the concept by adding pictures along with the tags to help them learn. General knowledge also plays an immense part in the learning process. We performed Named Entity Recognition to pick out names of people, places, institutions from the video content and help them get familiar through the use of images.

How we built it

** Note: The entire code was written in Python 3.6 Modules : Neural Style Transfer :

Text Simplification: First, the audio was extracted out from the video. Then it was converted into text using the Rev Speech to text API. The sentence-wise text summarization was implemented in Tensorflow. This uses an LSTM (Long Short-Term Memory) Network that has 2 layers with 150 LSTM cells each. The LSTM is bidirectional, i.e., it uses both past losses and sequences ahead of it to train for the current batch. The model is a sequence to sequence model.

Voice Modification: We downloaded videos of The Chipmunks speaking. We then converted the audio of the file to speech. We manually annotated the text corresponding to different time-stamps of the video. Then we used the Microsoft Azure custom voice API to create a generator for voice. We fed our text for the desired video to it to get the audio.

POS Tagging and Named Entity Recognition: These were done by taking the converted text from the audio of the file. These sentences were then tokenized, and divided into chunks. NLTK was used to implement both of the use cases.

Challenges we ran into

The initial style transfer algorithm we wrote needed to be trained for every new content image and hence made it the wrong choice for our application. Even after the model was made, it took us a lot of time predict the style image for all the frames of the video.

Training the LSTM network to produce a sentence summary. I decided to make the LSTM network lightweight so that it would produce summaries in almost real time. So, I decided to go for a network with 2 layers each with 64 cells. However, it was not enough for fitting the data. Most of the time it led to nonsensical sentences. Tuning the network was definitely time consuming and tedious. The final set of hyperparameters is 2 layers of 150 cells each and a beam width of 10 for attention mechanism.

Accomplishments that we're proud of

Almost real-time conversion from normal video to stylized video.

We achieved decent results on text simplification. At least most of the text looked like proper English sentences.

What we learned

A lot! To be honest, this is the first time we worked on Neural Style Transfer and Natural Language Processing. Through the two days, we learned about training large neural nets, working in the cloud, using amazing APIs and so much more.

What's next for Espial

We would like to incorporate language translation. This feature would enable users from different languages to use Espial. To make things a lot more fun and efficient, we would also like to perform a more personalized neural style transfer, depending on the specific cartoon that the kid likes to make this process even more engaging. Wouldn't it be amazing if kids could learn a lot more with the help of their favorite cartoons !!!

Built With

azure
numpy
python
speech-to-text
tensorflow
text-to-speech

Updates

piupunia Punia started this project — Oct 21, 2018 08:56 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.