Inspiration
When thinking of practical problems we could tackle within the scope of a hackathon using our programming and software development skills, translating languages came into mind. Although existing tools can already translate and display text from images, there is a dearth of open-source implementations and not much support for video formats. Therefore, we wanted to try and write this ourselves using a top-down design.
Key parts
We fundamentally needed to accomplish a few tasks in order to get the results we wanted:
- Run OCR on an image to get recognized text and corresponding positions
- Feed the recognized text into a translation engine to get the target language result
- Impose the result over the original image matching the texts' original positions
How we built it
For OCR, we used a Python wrapper of the Tesseract OCR engine named tesserocr. Efforts here mainly focused on storing any recognized text into a tuple with the text and its position in the image, then passing an array of the tuples for further processing in other portions of the program.
We used the Google Cloud Translation API to automatically translate the detected text into a target language (in the Colab demo, we tested Japanese to English), then cleaned the result and passed it on to the image manipulation routine.
Images were processed using Pillow to first draw a rectangle of average color over the text regions specified from the OCR, serving as a clean background for the text; then text was laid over it and centered so that it maintained the overall layout of the initial image.
For video processing, we split incoming mp4s into their video frames and processed each one as an individual image, then recombined them using the OpenCV video writer.
Challenges we ran into
We attempted to make a basic front-end for the project, but due to time limitations ran into issues with Flask's file path routing and trying to serve media to the browser. However, most of it is still functioning including the POST request system, so feel free to look at it in the overall GitHub repository.
Recognition accuracy was very unreliable at first until we did some image preprocessing using built in Pillow functions and locally equalizing the image, which made it easier for Tesseract to pick out text from.
There were initially hurdles in cleaning text to have it be translated, as well as some difficulty in figuring out the API documentation to properly set it up with the private key and authentication through Google.
Centering the text inside of the defined box was also a little iffy along the way, but we managed to have it be solved by the end, along with an efficient average color calculation.
What we learned
Overall, we gained a stronger grasp on the related Python libraries and image manipulation technologies since most of our previous background from schoolwork lied in C-like languages. The team collaboration process was also helpful to develop cooperative skills through working under the 24-hour time limit; being able to meet other people at the hackathon and network was also a benefit to us.
What's next for babbler
Ideally if work on the project were to be extended, babbler could be implemented as an app on mobile devices. It would probably hook into screen reader functions through the accessibility features on Android/iOS, and grab screenshots that way as well as temporarily overlay them onto the screen.
Translation quality on videos could also be potentially improved with further programming, where regions of text that are similar across time-evolution pull from a shared translation that's called once instead of running it on each frame (which currently can produce rapidly moving/changing text results).
It could also be useful to implement a keyword system if we dug more into the translation pipeline, where users can specify words in the original language to always translate to the same result if they know it would be more accurate contextually, and still preserve the same intended grammatical structure.
Log in or sign up for Devpost to join the conversation.