Inspiration

Otter.ai, Microsoft Teams, and Dialpad. What do they have in common? They are all great options for voice-to-text transcription. But how about visual-to-text transcription? There are an abundance of speech-to-text transcription services, but not visual to text. This is where MotionCue comes in.

MotionCue imagines a new form of transcribing called visual-to-text transcription by leveraging pose detection, ML, and NLP.

We were inspired by the QHack's 2023 theme, "Designing the Digital World" and hoped to combine our interests in AI and data wrangling to ensure that information in the form of a purely visual format, especially a digital piece of media such a choreographed video, is accessible to those who are visually impaired and would appreciate text descriptions for them to understand the physical movements occuring in the video. Furthermore, text description of an image may also help sighted readers by providing them another option for processing visual information. MotionCue is bringing digital accessibility to the Performing Arts. We are bringing tools to contextualize choreography in videos. We are digitalizing body movements in videos, so that we can provide users with a comprehensive text guide on how to perform them.

As background, human pose estimation has been studied for over 15 years and its importance stems from the range of benefit of this technology from character animation to clinical analysis of biomechanics. [1].

Here's a brief summary on other useful applications of pose detection:

  1. Translating ASL
  2. Biomechanics and Medication: Human pose and movement can indicate the health status of humans
  3. Sports Performance Analysis and Education: The automated extraction of poses of athletes from videos lead to greater depth of analysis of the performance of athletes and provide immediate feedback for their improvement.

What it does

MotionCue will parse a YouTube URL and return a visual description in textual format. The application is currently focused on the genre of short dance videos. It will digitise the body movements of dance videos (such as tiktok/youtubeshorts) from YouTube. Our application will then classify the movements and dance steps and provide users with a detailed textual guide on how to perform them!

How we built it

Architecture Pipeline for URL to Visual Transcription Data:

  1. Use a youtube frame extractor and use it to sample images every second.
  2. Process each frame to extract its pose landmark data using a pre-trained PoseDetection model (ie MoveNet, PostNet).
  3. Feed each frame into the random forest model for pose detection
  4. Output to frontend. Display text description in real time as the video is being played and also query OpenAI (ChatGPT) for additional information based on the labels of our pose detection model.

Building/training the model that generates the Visual Transcription Data:

First, we trained a quick random forest model on 2 poses by taking images manually that we recreated ourselves. Then, we found a youtube frame extractor and used it to sample images every second on tiktok videos for the popular “green green grass blue blue sky” videos. We then processed each image on the backend to extract pose landmarks. Landmarks are 2D positional data points that identify where specific body parts are on the image. We then used kmeans cluster on each image’s landmark data to identify the best key frames and corresponding pose names. After identifying the poses that we wanted to classify with our model, we trained our random forest model to classify poses. :)

Challenges we ran into

  • Data collection: we decided to collect, label our own dance dataset from YouTube due to the originality of our idea which consumed lots of time in the beginning and mid phase of the project. The team managed to overcome it by tight coordination and hardwork.

Accomplishments that we're proud of

  • Rethinking a new of way of transcription named "visual-to-text" transcription to help promote digital accessibility.
  • Building out a fully integrated web application using Python FastAPI, React.

What we learned

  • Curating and collecting good data is hard.
  • Start something really simple and build on top of that.
  • Blockchain is cute.

Training our model

-https://opencv.org/ -https://scikit-learn.org/stable/modules/ensemble.html#forest -https://google.github.io/mediapipe/solutions/pose

Inspecting our data

-Tensorflow pre-trained models - https://github.com/tensorflow/tfjs-models

Displaying to our ui

-https://github.com/u-wave/react-youtube

What's next for MotionCue

  • training our model identifying more genres visual videos

links: [1] https://www.cs.ubc.ca/~lsigal/Publications/SigalEncyclopediaCVdraft.pdf [2] https://www.sciencedirect.com/science/article/pii/S1077314221000692

Built With

Share this project:

Updates