TL;DR: Built an open-source Python tool that automatically extracts invoices and receipts from Gmail using email content classification. v2.0 ships with a major architectural shift that’s 3x faster and way more accurate. .
The Problem: My Accounting Nightmare 💸
As CEO of a tech startup, I’m comfortable with most technical challenges. But there’s one task I’d been postponing for literally years: organizing our company invoices and receipts.
Here’s what made it so painful: hundreds of invoices scattered across our company Gmail account, buried in 5+ years of email history. Every quarter, my CFO would ask me to extract them for accounting. Every quarter, I’d spend 2-3 hours manually searching “invoice”, downloading attachments one by one, renaming files, organizing by date.
After processing maybe 50 emails, I’d give up and postpone it another quarter. My CFO was going crazy.
Finally, I did what any developer-CEO would do when faced with repetitive manual labor: I built a tool to automate it. And I did it using vibe coding with Claude AI - iterating rapidly until it worked exactly how I needed.
The Journey: v1.0 vs v2.0 (Spoiler: Complete Rewrite) 🔄
v1.0: The PDF Text Extraction Trap
My first version seemed logical:
-
Fetch emails with attachments
-
Download each PDF
-
Extract text from PDF using pdfplumber
-
Run NLP classification on extracted text
-
Save if it’s an invoice
The problem? This approach was painfully slow. Processing 200 emails took 15+ minutes because:
-
PDF text extraction is expensive (especially for scanned documents)
-
Many invoices are actually scanned images with no text layer
-
I was processing attachments that clearly weren’t documents
v2.0: The Email Subject Is All You Need 💡
Then I had a realization: The email subject line usually tells you what’s in the attachment.
Think about it:
-
“Invoice #2024-001 - January Services” → definitely an invoice
-
“Your receipt from Amazon” → definitely a receipt
-
“Meeting notes” → not a document
So I completely rebuilt the classifier:
# v2.0 Classification Flow
def classify_email(subject, body):
"""
Classify based on EMAIL content, not attachment content.
Subject is weighted 3x for importance.
"""
# Combine subject (3x weight) + body
text = f"{subject} {subject} {subject} {body}"
# EXACT keyword matching (no fuzzy logic)
for keyword in EXACT_KEYWORDS:
if re.search(rf'\b{keyword}\b', text, re.IGNORECASE):
return classified_type, 1.0 # 100% confidence
return None, 0.0The results?
-
⚡ 3x faster: No PDF extraction during processing
-
🎯 More accurate: Email subjects are clearer than attachment content
-
📎 Handles all file types: Works with scanned images, PNGs, JPGs
-
🌍 Multi-language: Supports multiple languages (configurable keywords)
Architecture: How It Actually Works 🏗️
The Pipeline
Gmail IMAP → Email Parser → Classifier → File Manager → Report ↓ ↓ ↓ ↓ ↓ Fetch ALL Extract Analyze Save ALL Generate emails subject+ email attachments statistics body+ content (dedupe) attachments (exact keywords)
Core Components
1. GmailClient (src/gmail_client.py)
-
IMAP SSL connection with keepalive (prevents 30-min timeout)
-
Auto-detects “All Mail” folder in any Gmail language (works with localized Gmail interfaces)
-
Connection health checks and auto-reconnect
# Gmail connections timeout after 30 minutes # We send NOOP every 5 minutes to keep it alive def _start_keepalive(self): def keepalive_thread(): while self.keepalive_running: time.sleep(300) # 5 minutes try: self.imap.noop() except: self._reconnect()
2. DocumentClassifier (src/document_classifier.py)
-
EXACT keyword matching only (no fuzzy logic, no approximations)
-
Configurable keywords via YAML (supports any language)
-
Subject weighted 3x in classification
# Classification keywords (v2.0) - fully configurable
EXACT_KEYWORDS = {
'invoices': ['invoice'], # Add more in config/rules.yaml
'receipts': ['receipt']
}
# Match example:
# ✅ "Invoice #123" → matches (exact word "invoice")
# ✅ "Your Receipt" → matches (exact word "receipt")
# ❌ "invoicing system" → NO match ("invoicing" ≠ "invoice")
3. FileManager (src/file_manager.py)
-
SHA256 hash-based deduplication (content-based, not filename)
-
Organizes by document type and date:
output/invoices/2024-12/invoice_001.pdf -
Metadata tracking in JSON
4. Connection Keepalive System ⚡
Gmail IMAP has a 30-minute idle timeout. For large email accounts (1000+ emails), this is a problem.
Solution: Background thread that sends NOOP command every 5 minutes:
# Auto-reconnect if connection drops
def _check_connection(self):
try:
status = self.imap.noop()[0]
return status == 'OK'
except:
logger.warning("Connection lost, reconnecting...")
return self._reconnect()How to Use It: Quick Start 🚀
Setup (5 minutes)
# Clone repo git clone https://github.com/yourusername/gmail-doc-scrapper.git cd gmail-doc-scrapper # Automated setup ./setup.sh # Unix/Mac # or .\setup.ps1 # Windows # Create .env file cp .env.example .env # Add your Gmail credentials
Gmail Setup (required):
-
Enable IMAP in Gmail Settings
-
Enable 2FA on your Google account
-
Generate App Password:
-
Add to
.env:
GMAIL_EMAIL=your-email@gmail.com GMAIL_APP_PASSWORD=xxxx-xxxx-xxxx-xxxx
Run Interactive Mode
python main.py --interactive
Example session:
╭─────────────────────────────────────────────╮ │ Gmail Document Scraper - Interactive Mode │ ╰─────────────────────────────────────────────╯ Gmail email: joao.fernandes@gmail.com Gmail App Password: ****-****-****-**** Start date (YYYY-MM-DD): 2024-01-01 End date (YYYY-MM-DD): 2024-12-31 Folder [press Enter for ALL]: 🔍 Processing emails from 2024-01-01 to 2024-12-31... 📧 Found 847 emails to process [####################################] 100% ✅ Processing complete! 📊 Results: • Emails processed: 847 • Invoices found: 156 • Receipts found: 89 • Duplicates skipped: 12 • Total files saved: 233 📁 Files saved to: output/ 📄 Report: reports/report_20240101_143022.json
Resume Interrupted Runs
# Gmail rate limiting? Connection dropped? # Just resume from checkpoint python main.py --resume
The tool saves progress in reports/.checkpoint.json and continues exactly where it left off.
Docker Mode
# Build docker-compose build # Run interactive docker-compose run --rm gmail-scraper --interactive # Resume docker-compose run --rm gmail-scraper --resume
The Tech Stack 🛠️
Why these choices?
-
Python 3.9+: Mature email libraries (imaplib), great NLP ecosystem
-
spaCy: NLP model for entity recognition (optional, not required in v2.0)
-
pdfplumber: Best PDF text extraction library (for v1.0 compatibility)
-
Rich: Beautiful terminal UI with progress bars
-
Click: Clean CLI argument parsing
-
Docker: Zero-config deployment
Built with vibe coding using Claude AI - rapid iteration and pair programming with AI to get from idea to production-ready tool in days, not months.
Development tools:
-
pytest: Test coverage >80%
-
black: Code formatting (line-length: 100)
-
flake8: Linting
-
pre-commit: Git hooks for code quality
Configuration: Rules Are YAML 📝
config/rules.yaml:
invoices: display_name: "Invoices" keywords: - invoice # Add keywords for your language here patterns: [] # Disabled in v2.0 entities: [] # Disabled in v2.0 receipts: display_name: "Receipts" keywords: - receipt # Add keywords for your language here patterns: [] # Disabled in v2.0 entities: [] # Disabled in v2.0
Want to add more document types? Just add entries to rules.yaml. Want stricter matching? Adjust confidence_threshold in config/config.yaml.
Testing: Quick & Comprehensive 🧪
# Run all tests with coverage pytest tests/ -v --cov=src --cov-report=html # Quick integration test (10 emails only) python test_quick.py # Verify installation python test_installation.py
Test structure:
tests/ ├── test_config_loader.py # Config validation ├── test_document_classifier.py # Classification logic └── test_file_manager.py # File operations
Coverage requirements: >80% on src/ modules
Why Open Source? 🌍
I built this tool because I needed it for my startup. Then I realized every tech company, startup, and business has this problem - invoice and receipt organization is universally painful.
Making it open source means:
-
✅ Community contributions: Better language support, more document types
-
✅ Transparency: You can audit exactly what it does with your email data
-
✅ Customization: Fork it, modify it, make it yours
-
✅ No vendor lock-in: Your data stays local, no API subscriptions
Plus, building in public with AI assistance (Claude) showed me how fast you can ship when you combine human product vision with AI coding velocity.
License: MIT (do whatever you want with it)
Performance: Real Numbers 📊
Tested on my personal Gmail account (5+ years of emails):
| Metric | v1.0 | v2.0 |
|--------|------|------|
| Emails processed | 847 | 847 |
| Processing time | 18m 23s | 6m 12s |
| Invoices found | 142 | 156 |
| False positives | 8 | 2 |
| Scanned PDFs handled | ❌ Failed | ✅ Success |
v2.0 is 3x faster and finds 10% more documents with 75% fewer false positives.
Limitations & Future Work 🚧
Current Limitations
Language Support: Ships with English keywords. Add your language keywords in config/rules.yaml.
Exact Matching Only: Won’t match synonyms like “bill” or “payment slip” (by design - reduces false positives).
Gmail Only: Built for Gmail IMAP. Other email providers need testing (should work with any IMAP server).
Roadmap (Contributions Welcome!)
-
LLM-based classification (GPT-4/Claude API integration for 95%+ accuracy)
-
Structured metadata extraction (invoice numbers, dates, amounts)
-
CSV export (export metadata to spreadsheet)
-
Web UI (Django/Flask dashboard for non-technical users)
-
More languages (Spanish, French, German keywords)
-
Outlook support (Microsoft Graph API integration)
Contributing 🤝
PRs welcome! Areas where help is needed:
-
Language support: Add keywords for your language in
config/rules.yaml -
Email provider testing: Test with Outlook, ProtonMail, etc.
-
Documentation: Improve guides, add tutorials
Development setup:
# Clone repo git clone https://github.com/yourusername/gmail-doc-scrapper.git cd gmail-doc-scrapper # Install dev dependencies pip install -r requirements-dev.txt # Install pre-commit hooks pre-commit install # Run tests make test # Format code make format
Conclusion: Build Tools You Need 🎯
This project started as something I’d been postponing for years - a task so tedious my CFO was going crazy. Using vibe coding with Claude AI, I went from idea to production-ready tool that processes thousands of emails reliably.
Key lessons learned:
-
Sometimes the obvious solution is wrong: v1.0 seemed logical (extract PDF text), but v2.0’s email-based classification is simpler and better.
-
IMAP is harder than it looks: Connection timeouts, rate limiting, folder name localization - there’s hidden complexity.
-
AI-assisted development is a force multiplier: Using Claude for pair programming let me iterate incredibly fast. What would’ve taken weeks took days.
-
Open source wins: Releasing this publicly led to better architecture decisions (I knew others would read the code).
Try it yourself:
git clone https://github.com/yourusername/gmail-doc-scrapper.git cd gmail-doc-scrapper ./setup.sh python main.py --interactive
Found it useful? ⭐ Star the repo or contribute a PR!
Got questions or ideas? Open an issue or reach out:
Links:
-
📦 GitHub:
-
📖 Documentation: Full docs in the README
-
🐛 Issues: Report bugs or request features
-
💬 Discussions: Share your use cases
Built with ❤️ for document automation using vibe coding with Claude AI. Licensed under MIT.