Inspiration

We've all seen those hypothetical "who would win" debates online—Elon vs Sam, east coast vs west coast, your boss vs your coworker. We thought: what if AI could actually settle these beefs? Rap battles are the ultimate arena for verbal combat, and with Grok's new voice and image APIs, we realized we could make anyone battle anyone. The idea of hearing your friend roast their rival in Kendrick's flow was too good not to build.

What it does

Grok Rap Battle turns any two people into AI-generated rap battlers:

  1. Input: Pick two fighters, upload voice samples, optionally link their X/Twitter handles
  2. Lyrics: Grok scrapes their social media for personality traits and generates personalized diss tracks
  3. Voice Cloning: ElevenLabs combines their voice identity with a rapper's cadence (Stormzy, Eminem, Drake, etc.)
  4. Audio: Grok Voice API generates the actual rap performance with the cloned style
  5. Beat: Grok generates a matching instrumental and mixes it with vocals
  6. Video: Grok Image API creates storyboard frames, Runway generates video, Sync Labs adds lip sync
  7. Output: A complete rap battle video ready to share

How we built it

  • Backend: FastAPI + Python with async pipeline orchestration
  • Frontend: Custom 8-bit arcade-style UI (Street Fighter vibes) + Gradio for rapid prototyping
  • Voice Pipeline: ElevenLabs speech-to-speech for style transfer (celebrity mode with pitch shifting to bypass detection), Grok Voice API for final generation
  • Audio: Custom beat generator using pydub, BPM detection with librosa, ffmpeg for mixing
  • Video: Grok Image API for storyboards, Runway for image-to-video, Sync Labs for lip sync
  • Context: X/Twitter API integration for personality-aware lyrics
  • Real-time Progress: Server-Sent Events (SSE) for live pipeline status updates

Challenges we ran into

  • Voice Detection: ElevenLabs blocks celebrity voices. We built "celebrity mode" that pitch-shifts audio down before cloning, then reverses it after—effectively bypassing fingerprint detection while maintaining voice quality.
  • Timing Sync: Matching rap vocals to beat tempo required creating and detecting BPM from generated speech and dynamically generating beats to match, not the other way around.
  • Pipeline Complexity: 12 stages across 5 different AI APIs (Grok text, voice, image + ElevenLabs + Runway + Sync Labs). Managing failures, fallbacks, and progress tracking across all of them was a beast.
  • Style Transfer: Getting the rapper's cadence without their voice required chaining voice cloning → speech-to-speech transformation → pitch correction.

Accomplishments that we're proud of

  • End-to-end generation: From two names to a complete lip-synced rap battle video with limited intervention
    • Voice style transfer: You actually sound like yourself rapping like Kendrick
    • The UI: An 8-bit arcade cabinet interface that makes AI feel nostalgic and fun
    • Celebrity mode: Cracked the ElevenLabs voice detection and Runway video generation with workarounds.
    • Real-time feedback: Watch your battle get built stage-by-stage with live progress updates

What we learned

  • Grok's voice API is really good at expressive speech—it captures rap cadence better than expected
  • Chaining multiple AI services requires serious error handling and fallback strategies
  • Voice cloning ethics are complex—there's a reason these protections exist
  • Building for fun (rap battles) actually pushes technical boundaries harder than "serious" applications
  • The best demos are the ones that make people laugh

What's next for Grok Rap Battle

  • Multi-round battles: 4+ verse exchanges with escalating intensity
  • Tag team battles: 2v2 showdown of your favourite people
  • Custom beats: Let users upload their own instrumentals or generate from prompts
  • Battle templates: Historical figures, fictional characters, meme formats

Built With

Share this project:

Updates