Skip to main content
r/DocDigitizer icon

r/DocDigitizer illuminati illuminati

Launching!

members
online

How I Built an AI-Powered Gmail Invoice Extractor (And Why v2.0 Changed Everything) How I Built an AI-Powered Gmail Invoice Extractor (And Why v2.0 Changed Everything)

TL;DR: Built an open-source Python tool that automatically extracts invoices and receipts from Gmail using email content classification. v2.0 ships with a major architectural shift that’s 3x faster and way more accurate. GitHub repo here.

The Problem: My Accounting Nightmare 💸

As CEO of a tech startup, I’m comfortable with most technical challenges. But there’s one task I’d been postponing for literally years: organizing our company invoices and receipts.

Here’s what made it so painful: hundreds of invoices scattered across our company Gmail account, buried in 5+ years of email history. Every quarter, my CFO would ask me to extract them for accounting. Every quarter, I’d spend 2-3 hours manually searching “invoice”, downloading attachments one by one, renaming files, organizing by date.

After processing maybe 50 emails, I’d give up and postpone it another quarter. My CFO was going crazy.

Finally, I did what any developer-CEO would do when faced with repetitive manual labor: I built a tool to automate it. And I did it using vibe coding with Claude AI - iterating rapidly until it worked exactly how I needed.

The Journey: v1.0 vs v2.0 (Spoiler: Complete Rewrite) 🔄

v1.0: The PDF Text Extraction Trap

My first version seemed logical:

  1. Fetch emails with attachments

  2. Download each PDF

  3. Extract text from PDF using pdfplumber

  4. Run NLP classification on extracted text

  5. Save if it’s an invoice

The problem? This approach was painfully slow. Processing 200 emails took 15+ minutes because:

  • PDF text extraction is expensive (especially for scanned documents)

  • Many invoices are actually scanned images with no text layer

  • I was processing attachments that clearly weren’t documents

v2.0: The Email Subject Is All You Need 💡

Then I had a realization: The email subject line usually tells you what’s in the attachment.

Think about it:

  • “Invoice #2024-001 - January Services” → definitely an invoice

  • “Your receipt from Amazon” → definitely a receipt

  • “Meeting notes” → not a document

So I completely rebuilt the classifier:

# v2.0 Classification Flow

def  classify_email(subject, body):

"""

Classify based on EMAIL content, not attachment content.

Subject is weighted 3x for importance.

"""

# Combine subject (3x weight) + body

text = f"{subject}  {subject}  {subject}  {body}"

  

# EXACT keyword matching (no fuzzy logic)

for keyword in EXACT_KEYWORDS:

if re.search(rf'\b{keyword}\b', text, re.IGNORECASE):

return classified_type, 1.0  # 100% confidence

  

return  None, 0.0

The results?

  • 3x faster: No PDF extraction during processing

  • 🎯 More accurate: Email subjects are clearer than attachment content

  • 📎 Handles all file types: Works with scanned images, PNGs, JPGs

  • 🌍 Multi-language: Supports multiple languages (configurable keywords)

Architecture: How It Actually Works 🏗️

The Pipeline

Gmail IMAP → Email Parser → Classifier → File Manager → Report

↓ ↓ ↓ ↓ ↓

Fetch ALL Extract Analyze Save ALL Generate

emails subject+ email attachments statistics

body+ content (dedupe)

attachments (exact

keywords)

Core Components

1. GmailClient (src/gmail_client.py)

  • IMAP SSL connection with keepalive (prevents 30-min timeout)

  • Auto-detects “All Mail” folder in any Gmail language (works with localized Gmail interfaces)

  • Connection health checks and auto-reconnect

# Gmail connections timeout after 30 minutes

# We send NOOP every 5 minutes to keep it alive

def  _start_keepalive(self):

def  keepalive_thread():

while  self.keepalive_running:

time.sleep(300) # 5 minutes

try:

self.imap.noop()

except:

self._reconnect()

2. DocumentClassifier (src/document_classifier.py)

  • EXACT keyword matching only (no fuzzy logic, no approximations)

  • Configurable keywords via YAML (supports any language)

  • Subject weighted 3x in classification

# Classification keywords (v2.0) - fully configurable

EXACT_KEYWORDS = {

'invoices': ['invoice'], # Add more in config/rules.yaml

'receipts': ['receipt']

}

  

# Match example:

# ✅ "Invoice #123" → matches (exact word "invoice")

# ✅ "Your Receipt" → matches (exact word "receipt")

# ❌ "invoicing system" → NO match ("invoicing" ≠ "invoice")

3. FileManager (src/file_manager.py)

  • SHA256 hash-based deduplication (content-based, not filename)

  • Organizes by document type and date: output/invoices/2024-12/invoice_001.pdf

  • Metadata tracking in JSON

4. Connection Keepalive System

Gmail IMAP has a 30-minute idle timeout. For large email accounts (1000+ emails), this is a problem.

Solution: Background thread that sends NOOP command every 5 minutes:

# Auto-reconnect if connection drops

def  _check_connection(self):

try:

status = self.imap.noop()[0]

return status == 'OK'

except:

logger.warning("Connection lost, reconnecting...")

return  self._reconnect()

How to Use It: Quick Start 🚀

Setup (5 minutes)

# Clone repo

git  clone  https://github.com/yourusername/gmail-doc-scrapper.git

cd  gmail-doc-scrapper

  

# Automated setup

./setup.sh  # Unix/Mac

# or

.\setup.ps1  # Windows

  

# Create .env file

cp  .env.example  .env

# Add your Gmail credentials

Gmail Setup (required):

  1. Enable IMAP in Gmail Settings

  2. Enable 2FA on your Google account

  3. Generate App Password: https://myaccount.google.com/apppasswords

  4. Add to .env:

GMAIL_EMAIL=your-email@gmail.com

GMAIL_APP_PASSWORD=xxxx-xxxx-xxxx-xxxx

Run Interactive Mode

python  main.py  --interactive

Example session:

╭─────────────────────────────────────────────╮

│ Gmail Document Scraper - Interactive Mode │

╰─────────────────────────────────────────────╯

  

Gmail email: joao.fernandes@gmail.com

Gmail App Password: ****-****-****-****

Start date (YYYY-MM-DD): 2024-01-01

End date (YYYY-MM-DD): 2024-12-31

Folder [press Enter for ALL]:

  

🔍 Processing emails from 2024-01-01 to 2024-12-31...

📧 Found 847 emails to process

  

[####################################] 100%

  

✅ Processing complete!

📊 Results:

• Emails processed: 847

• Invoices found: 156

• Receipts found: 89

• Duplicates skipped: 12

• Total files saved: 233

  

📁 Files saved to: output/

📄 Report: reports/report_20240101_143022.json

Resume Interrupted Runs

# Gmail rate limiting? Connection dropped?

# Just resume from checkpoint

python  main.py  --resume

The tool saves progress in reports/.checkpoint.json and continues exactly where it left off.

Docker Mode

# Build

docker-compose  build

  

# Run interactive

docker-compose  run  --rm  gmail-scraper  --interactive

  

# Resume

docker-compose  run  --rm  gmail-scraper  --resume

The Tech Stack 🛠️

Why these choices?

  • Python 3.9+: Mature email libraries (imaplib), great NLP ecosystem

  • spaCy: NLP model for entity recognition (optional, not required in v2.0)

  • pdfplumber: Best PDF text extraction library (for v1.0 compatibility)

  • Rich: Beautiful terminal UI with progress bars

  • Click: Clean CLI argument parsing

  • Docker: Zero-config deployment

Built with vibe coding using Claude AI - rapid iteration and pair programming with AI to get from idea to production-ready tool in days, not months.

Development tools:

  • pytest: Test coverage >80%

  • black: Code formatting (line-length: 100)

  • flake8: Linting

  • pre-commit: Git hooks for code quality

Configuration: Rules Are YAML 📝

config/rules.yaml:

invoices:

display_name: "Invoices"

keywords:

- invoice

# Add keywords for your language here

patterns: [] # Disabled in v2.0

entities: [] # Disabled in v2.0

  

receipts:

display_name: "Receipts"

keywords:

- receipt

# Add keywords for your language here

patterns: [] # Disabled in v2.0

entities: [] # Disabled in v2.0

Want to add more document types? Just add entries to rules.yaml. Want stricter matching? Adjust confidence_threshold in config/config.yaml.

Testing: Quick & Comprehensive 🧪

# Run all tests with coverage

pytest  tests/  -v  --cov=src  --cov-report=html

  

# Quick integration test (10 emails only)

python  test_quick.py

  

# Verify installation

python  test_installation.py

Test structure:

tests/

├── test_config_loader.py # Config validation

├── test_document_classifier.py # Classification logic

└── test_file_manager.py # File operations

Coverage requirements: >80% on src/ modules

Why Open Source? 🌍

I built this tool because I needed it for my startup. Then I realized every tech company, startup, and business has this problem - invoice and receipt organization is universally painful.

Making it open source means:

  • Community contributions: Better language support, more document types

  • Transparency: You can audit exactly what it does with your email data

  • Customization: Fork it, modify it, make it yours

  • No vendor lock-in: Your data stays local, no API subscriptions

Plus, building in public with AI assistance (Claude) showed me how fast you can ship when you combine human product vision with AI coding velocity.

License: MIT (do whatever you want with it)

Performance: Real Numbers 📊

Tested on my personal Gmail account (5+ years of emails):

| Metric | v1.0 | v2.0 |

|--------|------|------|

| Emails processed | 847 | 847 |

| Processing time | 18m 23s | 6m 12s |

| Invoices found | 142 | 156 |

| False positives | 8 | 2 |

| Scanned PDFs handled | ❌ Failed | ✅ Success |

v2.0 is 3x faster and finds 10% more documents with 75% fewer false positives.

Limitations & Future Work 🚧

Current Limitations

Language Support: Ships with English keywords. Add your language keywords in config/rules.yaml.

Exact Matching Only: Won’t match synonyms like “bill” or “payment slip” (by design - reduces false positives).

Gmail Only: Built for Gmail IMAP. Other email providers need testing (should work with any IMAP server).

Roadmap (Contributions Welcome!)

  • LLM-based classification (GPT-4/Claude API integration for 95%+ accuracy)

  • Structured metadata extraction (invoice numbers, dates, amounts)

  • CSV export (export metadata to spreadsheet)

  • Web UI (Django/Flask dashboard for non-technical users)

  • More languages (Spanish, French, German keywords)

  • Outlook support (Microsoft Graph API integration)

Contributing 🤝

PRs welcome! Areas where help is needed:

  1. Language support: Add keywords for your language in config/rules.yaml

  2. Email provider testing: Test with Outlook, ProtonMail, etc.

  3. Documentation: Improve guides, add tutorials

Development setup:

# Clone repo

git  clone  https://github.com/yourusername/gmail-doc-scrapper.git

cd  gmail-doc-scrapper

  

# Install dev dependencies

pip  install  -r  requirements-dev.txt

  

# Install pre-commit hooks

pre-commit  install

  

# Run tests

make  test

  

# Format code

make  format

Conclusion: Build Tools You Need 🎯

This project started as something I’d been postponing for years - a task so tedious my CFO was going crazy. Using vibe coding with Claude AI, I went from idea to production-ready tool that processes thousands of emails reliably.

Key lessons learned:

  1. Sometimes the obvious solution is wrong: v1.0 seemed logical (extract PDF text), but v2.0’s email-based classification is simpler and better.

  2. IMAP is harder than it looks: Connection timeouts, rate limiting, folder name localization - there’s hidden complexity.

  3. AI-assisted development is a force multiplier: Using Claude for pair programming let me iterate incredibly fast. What would’ve taken weeks took days.

  4. Open source wins: Releasing this publicly led to better architecture decisions (I knew others would read the code).

Try it yourself:

git  clone  https://github.com/yourusername/gmail-doc-scrapper.git

cd  gmail-doc-scrapper

./setup.sh

python  main.py  --interactive

Found it useful? ⭐ Star the repo or contribute a PR!

Got questions or ideas? Open an issue or reach out: joao.fernandes@docdigitizer.com

Links:

Built with ❤️ for document automation using vibe coding with Claude AI. Licensed under MIT.


A Not-So-Brief Story of LLMs and Their Impact on Humans A Not-So-Brief Story of LLMs and Their Impact on Humans

For decades, we measured AI by its ability to solve problems.

Beat Kasparov at chess. Win at Jeopardy. Classify images. Optimize routes. Powerful stuff — but always at arm's length. Impressive, but impersonal.

What changed?

People say "it understands me" when they talk to ChatGPT. That sentence is wrong, but revealing.

What they're feeling isn't understanding. It's coherence across time.

The machine remembers what was said five turns ago. It brings it forward. It relates it to something new. It doesn't reset after each sentence.

For decades, machines did exactly that: reset.

We assumed intelligence was primarily about problem-solving.

- Solve this puzzle.

- Optimize this path.

- Maximize this outcome.

That view made sense in a world dominated by engineering. But human intelligence was never just that.

Human intelligence isn't impressive because it solves problems efficiently. It's impressive because it relates problems to each other.

We connect a technical issue to an ethical concern. A personal experience to an abstract idea. A memory to a future possibility. We move fluidly across domains without instructions.

And the medium through which all of that happens is language.

Not language as grammar. Not language as syntax.

Language as the substrate of thought.

That's the shift.

AI didn't suddenly become "intelligent" in the classical sense. It became fluent. It gained the ability to participate in dialogue. To sustain conversation. To move between topics without breaking character.

And that fluency tricks us into feeling like we're being understood — even when we know, intellectually, that we're not.

I recorded a 45-minute deep dive on this for my podcast: "A Not-So-Brief Story of LLMs and Their Impact on Humans"

If you're interested in thinking through this at a conceptual level — not just hyping or dooming — the episode is here: https://open.spotify.com/episode/7APEsUJmVbEDleNWsFiHk1

Do you think fluency is enough to replace human thought in domains like leadership, strategy, or design? Or is there something else we have that can't be replicated through language alone?

Curious to hear how others are thinking about this.


Welcome to r/DocDigitizer — The Community for IDP, Automation & Document Intelligence
Brand Affiliate

Welcome to the official DocDigitizer Community — a place for developers, integrators, automation engineers, partners, and curious builders to discuss everything related to:

  • Intelligent Document Processing (IDP)

  • Extraction, classification & validation

  • DocDigitizer APIs, SDKs & connectors

  • Low-code/no-code integrations

  • Automation, RPA, BPM & AI workflows

  • Community tools, best practices & collaboration

Whether you're processing invoices, contracts, resumes, forms, or complex multi-doc streams — you’re in the right place.

🎯 What This Community Is For

This subreddit exists to:

✓ Share knowledge

Tips, tricks, examples, integrations, architectures, and lessons learned.

✓ Ask questions

Troubleshooting, use cases, errors, performance, scaling, SDK behavior, document structures, etc.

✓ Showcase tools

Wrappers, utilities, integrations, low-code flows, CLI tools, open-source components.

✓ Discuss IDP industry topics

OCR engines, automation frameworks, benchmarks, compliance, data quality, challenges.

✓ Collaborate

Co-building tools, refining connectors, improving onboarding, sharing patterns.

If you're working with DocDigitizer, IDP, or automation ecosystems, this is your space.

📌 Start Here

To help you get the most out of this community:

1. Read the community rules

They are simple, reasonable, and designed to keep the space high-quality.
👉 You can find them in the sidebar and wiki.

2. When posting a question

Please include:

  • Your integration type (API, SDK, n8n, OutSystems, Salesforce, etc.)

  • Document type (invoice, contract, form, multi-doc batch, etc.)

  • Redacted samples if needed

  • Error messages or logs (sanitized)

  • Expected vs actual behavior

This helps the community give accurate and fast answers.

3. Protect sensitive data

Remove PII, IDs, financial data, or customer documents.
Always upload only masked or synthetic data.

💬 Types of Posts You Can Share

  • 🟩 Questions / Help

  • 🟦 How-to guides / Tutorials

  • 🟧 Integrations & Automation Flows

  • 🟥 Bugs / Issues / Error Diagnoses

  • 🟪 Open-source tools & scripts

  • 🟨 Feature suggestions / Feedback

  • 🟫 Technical discussions (IDP / OCR / AI)

Use post flairs when possible — it helps everyone stay organized.

🤝 Not Official Support — But the Next Best Thing

This subreddit is community-driven.
You’ll find real-world experience, people integrating DocDigitizer in production, and plenty of practical insights.

For SLA-backed support or account-specific topics, please continue using the official channels.

🚀 Let’s Build the Best IDP Community Together

Ask questions.
Share your tools.
Show integrations.
Collaborate on ideas.
Push the boundaries of document automation.

Thanks for joining — and welcome to r/DocDigitizer!