babbler | Devpost

Inspiration

When thinking of practical problems we could tackle within the scope of a hackathon using our programming and software development skills, translating languages came into mind. Although existing tools can already translate and display text from images, there is a dearth of open-source implementations and not much support for video formats. Therefore, we wanted to try and write this ourselves using a top-down design.

Key parts

We fundamentally needed to accomplish a few tasks in order to get the results we wanted:

Run OCR on an image to get recognized text and corresponding positions
Feed the recognized text into a translation engine to get the target language result
Impose the result over the original image matching the texts' original positions

How we built it

For OCR, we used a Python wrapper of the Tesseract OCR engine named tesserocr. Efforts here mainly focused on storing any recognized text into a tuple with the text and its position in the image, then passing an array of the tuples for further processing in other portions of the program.

We used the Google Cloud Translation API to automatically translate the detected text into a target language (in the Colab demo, we tested Japanese to English), then cleaned the result and passed it on to the image manipulation routine.

Images were processed using Pillow to first draw a rectangle of average color over the text regions specified from the OCR, serving as a clean background for the text; then text was laid over it and centered so that it maintained the overall layout of the initial image.

For video processing, we split incoming mp4s into their video frames and processed each one as an individual image, then recombined them using the OpenCV video writer.

Challenges we ran into

We attempted to make a basic front-end for the project, but due to time limitations ran into issues with Flask's file path routing and trying to serve media to the browser. However, most of it is still functioning including the POST request system, so feel free to look at it in the overall GitHub repository.

Recognition accuracy was very unreliable at first until we did some image preprocessing using built in Pillow functions and locally equalizing the image, which made it easier for Tesseract to pick out text from.

There were initially hurdles in cleaning text to have it be translated, as well as some difficulty in figuring out the API documentation to properly set it up with the private key and authentication through Google.

Centering the text inside of the defined box was also a little iffy along the way, but we managed to have it be solved by the end, along with an efficient average color calculation.

What we learned

Overall, we gained a stronger grasp on the related Python libraries and image manipulation technologies since most of our previous background from schoolwork lied in C-like languages. The team collaboration process was also helpful to develop cooperative skills through working under the 24-hour time limit; being able to meet other people at the hackathon and network was also a benefit to us.

What's next for babbler

Ideally if work on the project were to be extended, babbler could be implemented as an app on mobile devices. It would probably hook into screen reader functions through the accessibility features on Android/iOS, and grab screenshots that way as well as temporarily overlay them onto the screen.

Translation quality on videos could also be potentially improved with further programming, where regions of text that are similar across time-evolution pull from a shared translation that's called once instead of running it on each frame (which currently can produce rapidly moving/changing text results).

It could also be useful to implement a keyword system if we dug more into the translation pipeline, where users can specify words in the original language to always translate to the same result if they know it would be more accurate contextually, and still preserve the same intended grammatical structure.

Built With

Submitted to

CrimsonCode 2024

Created by

I worked on implementing the backend functionality of our project, where we utilized the Google Cloud Translation API to automatically translate detected text into a target language. In our Colab demo, we specifically tested translating from Japanese to English. This involved setting up and integrating the API into our system.

Through this process, I gained valuable insights into how APIs work and learned about the intricacies of translation machines. It was interesting to see how the API could seamlessly handle language translation tasks.
Additionally, I played a role in cleaning the translated results to ensure accuracy and coherence, which was crucial for the overall success of our project. Once the text was cleaned up, I passed it on to the image manipulation routine for further processing. In the course of this Hackathon, I also had the opportunity to expand my knowledge and skills in web development technologies such as JavaScript, HTML, and CSS. Understanding these languages allowed me to contribute to the frontend aspects of our application with its user interface and experience.

Overall, working on this project provided me with a multifaceted learning experience, combining aspects of backend development, API integration, and frontend web development.

Nihal Thomas
Dao Zhu
Connor Chase
Derek Williams

Updates

Dao Zhu started this project — Feb 18, 2024 12:56 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.