SenseReact | Devpost

With recent advances in language modeling, particularly GPT-3, it is now possible to achieve many tasks in a zero-shot manner using carefully constructed prompts to GPT-3, without the need for any specialized training data. For example, Huang et. al recently explored the capabilities of GPT-3 in task planning (https://arxiv.org/pdf/2201.07207.pdf).

Intelligent Language Interface

To best utilize GPT-3, we propose a language-based model of computation, whose core operations are queries to GPT-3, and whose inputs and outputs can be managed by our specialized drivers (which include heavy usage of pretrained vision and other types of models) to convert any media (e.g. images) to text, and from text to any media (e.g. speech). This contrasts to traditional approaches, which are based on logic and more resembles human behavior.

Applications

We envision our project can be applied to many scenarios, such as action detection, security, conversation, and task planning.

N programming language

Our model is supported by our novel very high-level N (natural) programming language that the user codes to program the interface. The language consists of two parts: the logic module which is programmed in natural language that dictates how the pre-trained model interacts with the environment based on driver input as well as the setup/loop module that dictates the driver-to-driver and driver-language model.

Drivers

We include the following drivers:

Video captioning converts video to text by generating captions.
Multilingual speech to text transcription service converts audio to text
Multilingual text to speech environment interaction.

Video captioning algorithm

We proposed a novel video captioning algorithm based on Bayesian factorization. Given a sequence of frames f_i, we can use Bayes' Law to factorize the probability that the frames correspond to a sequence of texts t_i as P(t_1, ... t_n | f_1, ..., f_n) is proportional to \prod P(t_n|t_{n-1}, ... t_1) P(f_n|f_{n-1}, ... f_1, t_n, t_1) Observing that the first term is exactly language modeling, we propose the following method to generate captions for a continuous video.

Generate a starting caption for the first frame using any image captioning algorithm i.e. we involve OpenAI's sota CLIP with any standard image captioning model
For each future frame in the video, calculate its image embedding. If the image embedding drifts away from the caption, we propose new candidate captions by querying GPT-3 algorithm.
We pick the best candidate captions using the embeddings