About

PodTextify is the worlds first podcast-to-database program. The world is running out of high accuracy, human-written text to train LLM models on. With PodTextify, you can pass in a podcast name and download an arbitrary* amount of episodes from a scraped iTunes listing directory. From these downloads, the program converts them to text files, cleans and grammar checks them, and outputs them into a model dataset parquet file.

*iTunes limits requests with search forms larger than 500

Inspiration

PodTextify was inspired by the growing need for high-quality, human-written text to train Large Language Models (LLMs). The scarcity of such text, as the entire internet has been scraped at this point, coupled with the increasing demand for AI-driven solutions, led to the development of PodTextify. This project aims to bridge the gap between the valuable human conversation in podcast content and AI training datasets, leveraging the vast amount of podcast hours available on platforms like iTunes.

How we built it

The development of PodTextify involved several steps:

The program was designed to first scrape iTunes for podcast episodes based on the provided podcast name. This involved using web scraping techniques to extract episode information and download links. From there, we could simply download episodes.

However, the downloaded content was not always in the format we wanted. To solve this, we used ffmpeg and other libraries to format the audio into 16 kHz .wav file.

Once the episodes were downloaded and converted, we used whisper.cpp and OpenVINO to transcribe the audio .wav files into text. This was one of the hardest parts, as we needed to manage highly complex language models and low-level compute inference.

The transcribed text was then cleaned and checked for grammatical errors to ensure the quality of the dataset. This step was crucial for maintaining the high accuracy required for training LLMs, as we found out that the transcribed text was missing some spelling and grammar accuracy.

Finally, the cleaned and checked text was compiled into a model dataset parquet file. This file format was chosen for its compatibility with machine learning frameworks.

Challenges we ran into

Some of the challenges encountered during the development of PodTextify included:

Accuracy of Audio Transcription: Ensuring the accuracy of the transcribed text from audio was a significant challenge. The program had to be robust enough to handle various accents, speech rates, and background noises. Quality of Scraped Content: The quality and availability of podcast episodes on iTunes varied significantly, posing challenges in maintaining a consistent dataset. Efficiency and Scalability: As the program was designed to process an arbitrary number of episodes, ensuring efficient processing and scalability was crucial.

Accomplishments that we're proud of

Automation of Podcast to Text Conversion: The ability to automate the process of converting podcast episodes into a dataset for LLM training is a significant accomplishment. This automation not only saves time but also ensures the consistency and quality of the dataset.

Enhanced AI Training: By providing a high-quality dataset, PodTextify contributes to the advancement of AI technologies, particularly in the field of natural language processing.

General Program Standards for Python: By writing the program almost entirely in python, we learned many things about how the language works and operates. We also had to learn several libraries in order to implement functions such as internet access and file management. We also added many safety checks and sanitization to the program for the sake of the user.

Future plans

Expanding Support for Additional Podcast Platforms: Beyond iTunes, as Apple may shut off the endpoint at any time, the program could be extended to support other podcast/audio platforms, increasing the diversity and volume of the dataset.

Built With

Share this project:

Updates