How I Built an AI-Powered Gmail Invoice Extractor (And Why v2.0 Changed Everything)

TL;DR: Built an open-source Python tool that automatically extracts invoices and receipts from Gmail using email content classification. v2.0 ships with a major architectural shift that’s 3x faster and way more accurate. .

The Problem: My Accounting Nightmare 💸

As CEO of a tech startup, I’m comfortable with most technical challenges. But there’s one task I’d been postponing for literally years: organizing our company invoices and receipts.

Here’s what made it so painful: hundreds of invoices scattered across our company Gmail account, buried in 5+ years of email history. Every quarter, my CFO would ask me to extract them for accounting. Every quarter, I’d spend 2-3 hours manually searching “invoice”, downloading attachments one by one, renaming files, organizing by date.

After processing maybe 50 emails, I’d give up and postpone it another quarter. My CFO was going crazy.

Finally, I did what any developer-CEO would do when faced with repetitive manual labor: I built a tool to automate it. And I did it using vibe coding with Claude AI - iterating rapidly until it worked exactly how I needed.

The Journey: v1.0 vs v2.0 (Spoiler: Complete Rewrite) 🔄

v1.0: The PDF Text Extraction Trap

My first version seemed logical:

Fetch emails with attachments
Download each PDF
Extract text from PDF using pdfplumber
Run NLP classification on extracted text
Save if it’s an invoice

The problem? This approach was painfully slow. Processing 200 emails took 15+ minutes because:

PDF text extraction is expensive (especially for scanned documents)
Many invoices are actually scanned images with no text layer
I was processing attachments that clearly weren’t documents

v2.0: The Email Subject Is All You Need 💡

Then I had a realization: The email subject line usually tells you what’s in the attachment.

Think about it:

“Invoice #2024-001 - January Services” → definitely an invoice
“Your receipt from Amazon” → definitely a receipt
“Meeting notes” → not a document

So I completely rebuilt the classifier:

# v2.0 Classification Flow

def  classify_email(subject, body):

"""

Classify based on EMAIL content, not attachment content.

Subject is weighted 3x for importance.

"""

# Combine subject (3x weight) + body

text = f"{subject}  {subject}  {subject}  {body}"

  

# EXACT keyword matching (no fuzzy logic)

for keyword in EXACT_KEYWORDS:

if re.search(rf'\b{keyword}\b', text, re.IGNORECASE):

return classified_type, 1.0  # 100% confidence

  

return  None, 0.0

The results?

⚡ 3x faster: No PDF extraction during processing
🎯 More accurate: Email subjects are clearer than attachment content
📎 Handles all file types: Works with scanned images, PNGs, JPGs
🌍 Multi-language: Supports multiple languages (configurable keywords)

Architecture: How It Actually Works 🏗️

The Pipeline

Gmail IMAP → Email Parser → Classifier → File Manager → Report

↓ ↓ ↓ ↓ ↓

Fetch ALL Extract Analyze Save ALL Generate

emails subject+ email attachments statistics

body+ content (dedupe)

attachments (exact

keywords)

Core Components

1. GmailClient (src/gmail_client.py)

IMAP SSL connection with keepalive (prevents 30-min timeout)
Auto-detects “All Mail” folder in any Gmail language (works with localized Gmail interfaces)
Connection health checks and auto-reconnect

# Gmail connections timeout after 30 minutes

# We send NOOP every 5 minutes to keep it alive

def  _start_keepalive(self):

def  keepalive_thread():

while  self.keepalive_running:

time.sleep(300) # 5 minutes

try:

self.imap.noop()

except:

self._reconnect()

2. DocumentClassifier (src/document_classifier.py)

EXACT keyword matching only (no fuzzy logic, no approximations)
Configurable keywords via YAML (supports any language)
Subject weighted 3x in classification

# Classification keywords (v2.0) - fully configurable

EXACT_KEYWORDS = {

'invoices': ['invoice'], # Add more in config/rules.yaml

'receipts': ['receipt']

}

  

# Match example:

# ✅ "Invoice #123" → matches (exact word "invoice")

# ✅ "Your Receipt" → matches (exact word "receipt")

# ❌ "invoicing system" → NO match ("invoicing" ≠ "invoice")

3. FileManager (src/file_manager.py)

SHA256 hash-based deduplication (content-based, not filename)
Organizes by document type and date: output/invoices/2024-12/invoice_001.pdf
Metadata tracking in JSON

4. Connection Keepalive System ⚡

Gmail IMAP has a 30-minute idle timeout. For large email accounts (1000+ emails), this is a problem.

Solution: Background thread that sends NOOP command every 5 minutes:

# Auto-reconnect if connection drops

def  _check_connection(self):

try:

status = self.imap.noop()[0]

return status == 'OK'

except:

logger.warning("Connection lost, reconnecting...")

return  self._reconnect()

How to Use It: Quick Start 🚀

Setup (5 minutes)

# Clone repo

git  clone  https://github.com/yourusername/gmail-doc-scrapper.git

cd  gmail-doc-scrapper

  

# Automated setup

./setup.sh  # Unix/Mac

# or

.\setup.ps1  # Windows

  

# Create .env file

cp  .env.example  .env

# Add your Gmail credentials

Gmail Setup (required):

Enable IMAP in Gmail Settings
Enable 2FA on your Google account
Generate App Password:
Add to .env:

GMAIL_EMAIL=your-email@gmail.com

GMAIL_APP_PASSWORD=xxxx-xxxx-xxxx-xxxx

Run Interactive Mode

python  main.py  --interactive

Example session:

╭─────────────────────────────────────────────╮

│ Gmail Document Scraper - Interactive Mode │

╰─────────────────────────────────────────────╯

  

Gmail email: joao.fernandes@gmail.com

Gmail App Password: ****-****-****-****

Start date (YYYY-MM-DD): 2024-01-01

End date (YYYY-MM-DD): 2024-12-31

Folder [press Enter for ALL]:

  

🔍 Processing emails from 2024-01-01 to 2024-12-31...

📧 Found 847 emails to process

  

[####################################] 100%

  

✅ Processing complete!

📊 Results:

• Emails processed: 847

• Invoices found: 156

• Receipts found: 89

• Duplicates skipped: 12

• Total files saved: 233

  

📁 Files saved to: output/

📄 Report: reports/report_20240101_143022.json

Resume Interrupted Runs

# Gmail rate limiting? Connection dropped?

# Just resume from checkpoint

python  main.py  --resume

The tool saves progress in reports/.checkpoint.json and continues exactly where it left off.

Docker Mode

# Build

docker-compose  build

  

# Run interactive

docker-compose  run  --rm  gmail-scraper  --interactive

  

# Resume

docker-compose  run  --rm  gmail-scraper  --resume

The Tech Stack 🛠️

Why these choices?

Python 3.9+: Mature email libraries (imaplib), great NLP ecosystem
spaCy: NLP model for entity recognition (optional, not required in v2.0)
pdfplumber: Best PDF text extraction library (for v1.0 compatibility)
Rich: Beautiful terminal UI with progress bars
Click: Clean CLI argument parsing
Docker: Zero-config deployment

Built with vibe coding using Claude AI - rapid iteration and pair programming with AI to get from idea to production-ready tool in days, not months.

Development tools:

pytest: Test coverage >80%
black: Code formatting (line-length: 100)
flake8: Linting
pre-commit: Git hooks for code quality

Configuration: Rules Are YAML 📝

config/rules.yaml:

invoices:

display_name: "Invoices"

keywords:

- invoice

# Add keywords for your language here

patterns: [] # Disabled in v2.0

entities: [] # Disabled in v2.0

  

receipts:

display_name: "Receipts"

keywords:

- receipt

# Add keywords for your language here

patterns: [] # Disabled in v2.0

entities: [] # Disabled in v2.0

Want to add more document types? Just add entries to rules.yaml. Want stricter matching? Adjust confidence_threshold in config/config.yaml.

Testing: Quick & Comprehensive 🧪

# Run all tests with coverage

pytest  tests/  -v  --cov=src  --cov-report=html

  

# Quick integration test (10 emails only)

python  test_quick.py

  

# Verify installation

python  test_installation.py

Test structure:

tests/

├── test_config_loader.py # Config validation

├── test_document_classifier.py # Classification logic

└── test_file_manager.py # File operations

Coverage requirements: >80% on src/ modules

Why Open Source? 🌍

I built this tool because I needed it for my startup. Then I realized every tech company, startup, and business has this problem - invoice and receipt organization is universally painful.

Making it open source means:

✅ Community contributions: Better language support, more document types
✅ Transparency: You can audit exactly what it does with your email data
✅ Customization: Fork it, modify it, make it yours
✅ No vendor lock-in: Your data stays local, no API subscriptions

Plus, building in public with AI assistance (Claude) showed me how fast you can ship when you combine human product vision with AI coding velocity.

License: MIT (do whatever you want with it)

Performance: Real Numbers 📊

Tested on my personal Gmail account (5+ years of emails):

| Metric | v1.0 | v2.0 |

|--------|------|------|

| Emails processed | 847 | 847 |

| Processing time | 18m 23s | 6m 12s |

| Invoices found | 142 | 156 |

| False positives | 8 | 2 |

| Scanned PDFs handled | ❌ Failed | ✅ Success |

v2.0 is 3x faster and finds 10% more documents with 75% fewer false positives.

Limitations & Future Work 🚧

Current Limitations

Language Support: Ships with English keywords. Add your language keywords in config/rules.yaml.

Exact Matching Only: Won’t match synonyms like “bill” or “payment slip” (by design - reduces false positives).

Gmail Only: Built for Gmail IMAP. Other email providers need testing (should work with any IMAP server).

Roadmap (Contributions Welcome!)

LLM-based classification (GPT-4/Claude API integration for 95%+ accuracy)
Structured metadata extraction (invoice numbers, dates, amounts)
CSV export (export metadata to spreadsheet)
Web UI (Django/Flask dashboard for non-technical users)
More languages (Spanish, French, German keywords)
Outlook support (Microsoft Graph API integration)

Contributing 🤝

PRs welcome! Areas where help is needed:

Language support: Add keywords for your language in config/rules.yaml
Email provider testing: Test with Outlook, ProtonMail, etc.
Documentation: Improve guides, add tutorials

Development setup:

# Clone repo

git  clone  https://github.com/yourusername/gmail-doc-scrapper.git

cd  gmail-doc-scrapper

  

# Install dev dependencies

pip  install  -r  requirements-dev.txt

  

# Install pre-commit hooks

pre-commit  install

  

# Run tests

make  test

  

# Format code

make  format

Conclusion: Build Tools You Need 🎯

This project started as something I’d been postponing for years - a task so tedious my CFO was going crazy. Using vibe coding with Claude AI, I went from idea to production-ready tool that processes thousands of emails reliably.

Key lessons learned:

Sometimes the obvious solution is wrong: v1.0 seemed logical (extract PDF text), but v2.0’s email-based classification is simpler and better.
IMAP is harder than it looks: Connection timeouts, rate limiting, folder name localization - there’s hidden complexity.
AI-assisted development is a force multiplier: Using Claude for pair programming let me iterate incredibly fast. What would’ve taken weeks took days.
Open source wins: Releasing this publicly led to better architecture decisions (I knew others would read the code).

Try it yourself:

git  clone  https://github.com/yourusername/gmail-doc-scrapper.git

cd  gmail-doc-scrapper

./setup.sh

python  main.py  --interactive

Found it useful? ⭐ Star the repo or contribute a PR!

Got questions or ideas? Open an issue or reach out:

Links:

📦 GitHub:
📖 Documentation: Full docs in the README
🐛 Issues: Report bugs or request features
💬 Discussions: Share your use cases

Built with ❤️ for document automation using vibe coding with Claude AI. Licensed under MIT.

r/DocDigitizer Launching!

Community highlights

Welcome to r/DocDigitizer — The Community for IDP, Automation & Document Intelligence

The Problem: My Accounting Nightmare 💸

The Journey: v1.0 vs v2.0 (Spoiler: Complete Rewrite) 🔄

v1.0: The PDF Text Extraction Trap

v2.0: The Email Subject Is All You Need 💡

Architecture: How It Actually Works 🏗️

The Pipeline

Core Components

How to Use It: Quick Start 🚀

Setup (5 minutes)

Run Interactive Mode

Resume Interrupted Runs

Docker Mode

The Tech Stack 🛠️

Configuration: Rules Are YAML 📝

Testing: Quick & Comprehensive 🧪

Why Open Source? 🌍

Performance: Real Numbers 📊

Limitations & Future Work 🚧

Current Limitations

Roadmap (Contributions Welcome!)

Contributing 🤝

Conclusion: Build Tools You Need 🎯

🎯 What This Community Is For

✓ Share knowledge

✓ Ask questions

✓ Showcase tools

✓ Discuss IDP industry topics

✓ Collaborate

📌 Start Here

1. Read the community rules

2. When posting a question

3. Protect sensitive data

💬 Types of Posts You Can Share

🤝 Not Official Support — But the Next Best Thing

🚀 Let’s Build the Best IDP Community Together

Be respectful & constructive

Content must be relevant to DocDigitizer

No confidential or sensitive data

No spam, self-promotion, or marketing

Use meaningful titles & clear questions

Follow best practices for posting code

Collaboration posts must have clear intent

No misinformation or unverified claims

Moderation decisions are final

This is a community, not official support

r/DocDigitizer

Launching!