Inspiration

We’ve all tried to record something important—maybe a presentation, a class project, a podcast intro, or even a job application—and ended up cringing when we listened back. We noticed that it’s rarely the content that ruins a recording. It’s the small things: the “uhh,” “like,” half-sentences, awkward pauses while thinking, keyboard noise in the background, air conditioner hum, and the fact that we don’t sound as confident as we meant to. Editing that stuff out manually is painful. Re-recording over and over is frustrating. We wanted something that fixes the real problem: how we sound, not what we say. That was the moment CleanSpeak was born.

What it does

CleanSpeak takes a voice recording and turns it into something that sounds intentional. It removes filler words, trims long awkward silences, reduces background noise, and even boosts your voice so you sound more confident and consistent. It also creates two transcripts: one exactly as spoken and one cleaned up to match the edited audio. There are no sliders, filters, presets, or confusing controls—just upload a file and get back something that feels like you nailed the recording on the first try.

How we built it

The entire project is stitched together like a little assembly line of audio fixes. We start by converting anything (MP3, WAV, etc.) into a clean 16kHz WAV using FFmpeg. Then we pass that through Whisper to get word-level timestamps, not just raw text. Those timestamps are the secret: they let us identify exactly when filler words happen and when silence stretches too long. Our edit engine in Java (using TarsosDSP) then removes only those tiny disfluences and trims dead air. The tricky part is that audio can’t just be chopped like text, so every cut is re-blended using crossfades to avoid robotic jumps. Finally, we run a simple RMS voice leveler that boosts quiet sections and tones down overly loud peaks so the final audio sounds balanced and present. All of that happens automatically without any configuration.

Challenges we ran into

What surprised us most was that “just remove ums” is a lot harder than it sounds. Sometimes “um” connects two ideas, sometimes people breathe while saying it, and sometimes Whisper timestamps spill into surrounding words. We also ran into cases where cutting silence made people sound too fast, almost like they were in a rush, so we had to shorten pauses without destroying natural pacing. On the technical side, Whisper gives timestamps in text space, not audio space, so we had to carefully translate them to exact sample positions. We had moments where tiny math mistakes created loud pops or cut off the ends of words. There was a lot of “why does this sound terrible now?” followed by debugging milliseconds we didn’t think would matter. They mattered.

Accomplishments that we're proud of

We’re proud that CleanSpeak sounds like a real editor touched it, not a filter slapped on top. The combination of precise trimming + crossfades makes a huge difference. We also like that the tool quietly fixes confidence issues without changing anyone’s voice or personality. Another thing we’re proud of is the clean transcript output—it’s surprisingly useful for captioning, study notes, and podcast descriptions. And maybe our favorite part: it’s simple. We didn’t turn it into a giant app full of knobs and switches. It just does the job.

What we learned

We learned that audio editing is part engineering, part psychology. You can’t just strip everything “messy,” because a little hesitation sometimes sounds human. We learned that timestamps aren’t always right, noise reduction can’t fix every problem, and human speech is full of nuance that tools need to respect. Most importantly, we learned the value of subtle improvements. CleanSpeak doesn’t wow you with effects—it just quietly removes the things that distract from what you meant to say. And that’s the whole point.

What's next for CleanSpeak

There are a lot of directions this could go. A real-time version for online meetings would help students during presentations or job interviews. Podcasters could use presets to keep emotional pauses but remove distracting ones. A small waveform editor showing which words were removed would let creators fine-tune what gets cut. We’d also like to explore multilingual support and emotional awareness, so the tool can learn when hesitation is part of storytelling instead of a mistake. CleanSpeak started as an idea to “fix messy recordings,” but it could become a tool that helps people express themselves more clearly, professionally, and confidently—without learning audio engineering.

Built With

Share this project:

Updates