ScreenBuddy | Devpost

Introduction / Problem Statement:

Before we get started with our pitch, who here has heard of Photoshop? Likely most people given it’s the most popular image editor in the world.

Who here knows how to use every single feature that Photoshop offers? Likely no one, and for a good reason. Any idea how long you think the manual for Photoshop is?

It’s not 50 or 100 or 500 pages. It’s 1017 pages.

Adobe Photoshop - World’s most popular image editor. https://helpx.adobe.com/pdf/photoshop_reference.pdf

And this isn’t an isolated case!

Adobe Premiere Pro - World’s most popular video editor has a 818 pages manual. https://helpx.adobe.com/content/dam/help/en/pdf/premiere_pro_reference.pdf

DaVinci Resolve - world’s most popular free video editor has a 1060 pages manual. https://documents.blackmagicdesign.com/UserManuals/DaVinci_Resolve_12_Reference_Manual.pdf

As inexperienced video editors who needed to edit a demo video for this hackathon, we thought there has to be a better way to learn and use the features without going through thousand page manuals and hour-long YouTube videos.

Why can’t some"buddy" just tell me exactly what to do?

Solution:

That’s the problem we decided to solve at Hack Western!

Imagine an AI companion that could not only understand your queries but also help one navigate the user interface in realtime, providing step-by-step guidance through screen sharing while articulating instructions audibly.

Essentially ChatGPT but it can help you with whatever is on your screen in realtime!

Welcome to the new age of AI collaboration - Share your vision with ScreenBuddy!

Tech Stack:

The tech stack comprises:

A powerful combination of OpenCV and GPT-4-Vision for robust image recognition capabilities.
Vector embeddings are crafted using ChromaDB and LangChain, tailored specifically for training on DaVinci Resolve and Circle documentation, enhancing understanding and context.
GPT Whisper handles speech-to-text conversion.
GPT TTS seamlessly transforms text to speech.
The user interface is facilitated by the Tkinter Python Toolkit, offering a user-friendly screen-sharing experience for effective interaction with the AI system.

This comprehensive stack creates a synergistic environment, enabling intuitive and efficient navigation through complex interfaces, whether in video editing with DaVinci Resolve or managing blockchain transactions on Circle.

Challenges we ran into:

Integrating OpenCV and GPT-4-Vision for seamless image recognition posed technical hurdles when it came to streaming visual media files.
Fine-tuning vector embeddings using ChromaDB and LangChain required iterative experimentation to achieve good results with DaVinci Resolve.
Ensuring real-time responsiveness in Tkinter Python Toolkit for effective screen sharing and speech recognition was a significant challenge.

Accomplishments that we're proud of:

Successful integration of OpenCV and GPT-4-Vision for robust image recognition capabilities.
Precision in crafting vector embeddings via ChromaDB and LangChain for tailored training on DaVinci Resolve and Circle documentation.
Seamless implementation of GPT Whisper and GPT TTS for speech-to-text and text-to-speech transformations.
Development of a user-friendly interface using Tkinter Python Toolkit for intuitive screen sharing.

What we learned:

The synergy between computer vision and natural language processing is pivotal for effective AI-assisted navigation.
The importance of iterative testing and fine-tuning in creating a reliable and user-friendly system.
Addressing real-time responsiveness challenges in UI interactions enhances overall user experience.

What's next for our project:

Implementing user feedback for continuous improvement and refinement.
Exploring additional applications beyond DaVinci Resolve and Circle for a broader user base.
Enhancing the AI's contextual understanding for even more intuitive interactions.
Collaborating with the community to expand the range of supported interfaces and functionalities.