MotionCue | Devpost

GIF
Demo Clip (gif)
GIF
Homepage (gif)
Architecture Diagram
Feature 1
Feature 2
Feature 3
Logo

Inspiration

Otter.ai, Microsoft Teams, and Dialpad. What do they have in common? They are all great options for voice-to-text transcription. But how about visual-to-text transcription? There are an abundance of speech-to-text transcription services, but not visual to text. This is where MotionCue comes in.

MotionCue imagines a new form of transcribing called visual-to-text transcription by leveraging pose detection, ML, and NLP.

We were inspired by the QHack's 2023 theme, "Designing the Digital World" and hoped to combine our interests in AI and data wrangling to ensure that information in the form of a purely visual format, especially a digital piece of media such a choreographed video, is accessible to those who are visually impaired and would appreciate text descriptions for them to understand the physical movements occuring in the video. Furthermore, text description of an image may also help sighted readers by providing them another option for processing visual information. MotionCue is bringing digital accessibility to the Performing Arts. We are bringing tools to contextualize choreography in videos. We are digitalizing body movements in videos, so that we can provide users with a comprehensive text guide on how to perform them.

As background, human pose estimation has been studied for over 15 years and its importance stems from the range of benefit of this technology from character animation to clinical analysis of biomechanics. [1].

Here's a brief summary on other useful applications of pose detection:

Translating ASL
Biomechanics and Medication: Human pose and movement can indicate the health status of humans
Sports Performance Analysis and Education: The automated extraction of poses of athletes from videos lead to greater depth of analysis of the performance of athletes and provide immediate feedback for their improvement.

What it does

MotionCue will parse a YouTube URL and return a visual description in textual format. The application is currently focused on the genre of short dance videos. It will digitise the body movements of dance videos (such as tiktok/youtubeshorts) from YouTube. Our application will then classify the movements and dance steps and provide users with a detailed textual guide on how to perform them!

How we built it

Architecture Pipeline for URL to Visual Transcription Data:

Use a youtube frame extractor and use it to sample images every second.
Process each frame to extract its pose landmark data using a pre-trained PoseDetection model (ie MoveNet, PostNet).
Feed each frame into the random forest model for pose detection
Output to frontend. Display text description in real time as the video is being played and also query OpenAI (ChatGPT) for additional information based on the labels of our pose detection model.

Building/training the model that generates the Visual Transcription Data:

First, we trained a quick random forest model on 2 poses by taking images manually that we recreated ourselves. Then, we found a youtube frame extractor and used it to sample images every second on tiktok videos for the popular “green green grass blue blue sky” videos. We then processed each image on the backend to extract pose landmarks. Landmarks are 2D positional data points that identify where specific body parts are on the image. We then used kmeans cluster on each image’s landmark data to identify the best key frames and corresponding pose names. After identifying the poses that we wanted to classify with our model, we trained our random forest model to classify poses. :)

Challenges we ran into

Data collection: we decided to collect, label our own dance dataset from YouTube due to the originality of our idea which consumed lots of time in the beginning and mid phase of the project. The team managed to overcome it by tight coordination and hardwork.

Accomplishments that we're proud of

Rethinking a new of way of transcription named "visual-to-text" transcription to help promote digital accessibility.
Building out a fully integrated web application using Python FastAPI, React.