Catching Madoff

Inspiration

Every day, payments companies lose money not because fraud is undetectable — but because reviewing flagged transactions at scale is slow, opaque, and exhausting. Existing tools flag transactions and show a score. They rarely explain why. A human reviewer staring at "score: 0.87" with no context has to make a consequential decision blind.

We wanted to build the tool we'd actually want to use on a Monday morning: one that catches fraud and tells you exactly why it's suspicious, in plain language.

What We Built

Catching Madoff is a full-stack fraud detection and review platform built in 24 hours for the Valpay challenge. It ingests 1,000 credit card transactions, scores every one of them, and surfaces a human-readable review queue where an analyst can triage flagged transactions with full context. No SQL, no guesswork.

The system detects four fraud patterns:

Card testing — rapid micro-transactions on a single card in a tight window, a stolen card being tested before a big purchase
Amount spike — a transaction far above a card's normal spend baseline (we model each card's median and flag deviations above a learned threshold)
Merchant burst — a single merchant hit by multiple different cards in a short window, the signature of a compromised merchant
New device + spike — a large transaction on a device never seen for that card, a classic account takeover signal

Every flag comes with a human-readable explanation built from the signals that fired, weighted by their log-odds contribution to the final score. The score itself is a calibrated probability:

$$score = \sigma\left(\log\frac{0.07}{0.93} + \sum_i w_i\right)$$

where $ w_i $ are the log-odds weights of each signal and $ \sigma $ is the sigmoid function.

What Makes It Different

Explainability first. Every flagged transaction shows exactly which signals fired and how much each contributed, not just a number. A reviewer sees "$1,275 -> 42× this card's median spend · Gift-card purchase · First time this card uses this device" and can make a confident decision in seconds.

The cost-aware threshold slider. Reviewers can tune the tradeoff between false positives and missed fraud in real time. Moving toward Miss cost flags more transactions; toward FP cost flags fewer. The queue updates live.

Coordinated attack detection. Merchant burst transactions are linked visually the reviewer sees "QuickPay Online was hit by 7 different cards in 70 minutes" and can approve or dismiss the entire cluster in one action. This cross-card signal is invisible to per-card detectors.

Gemini-powered context layer. For merchant burst patterns, Google Gemini silently assesses whether the burst is plausible given the merchant type and time of day. A Tim Hortons receiving $400 transactions at 3 AM is structurally impossible; Amazon at 3 AM is normal. This contextual signal adjusts the effective score invisibly, making the queue smarter without adding UI complexity.

Session learning. Every reviewer decision feeds back into the detection thresholds in real time — the system recalibrates per-pattern confidence based on what the reviewer confirms and dismisses.

How We Built It

We split the work cleanly: one teammate owned the Python detection pipeline (pandas, NumPy, FastAPI), the other owned the React frontend (Vite, Tailwind, Zustand). We agreed on a strict JSON contract on hour one and worked in parallel from hour 2 onward.

The detection pipeline builds per-card baselines from the full transaction history, then scores each transaction by summing signal weights through a sigmoid. The frontend consumes the scored JSON, never touches detection logic, and handles all review state in memory with Zustand.

Challenges

The AliExpress/Spotify trap. 60 transactions in the dataset are cross-border but entirely legitimate (AliExpress ships from China, Spotify bills from Sweden). Naively flagging merchant_country ≠ cardholder_country would have tanked our precision with 60 false positives. We learned to treat geographic mismatch as a combining signal, never a standalone flag.

Unrelated git histories. We initialized our repos independently and hit fatal: refusing to merge unrelated histories mid-hackathon. Since our code lived in /frontend and /backend with no file overlap, --allow-unrelated-histories merged cleanly — but it cost us 20 minutes we didn't have.

Windows encoding. The scoring pipeline crashed on Windows with a UnicodeEncodeError on the sigma character (σ) in our reason strings. One-line fix: encoding="utf-8" on pathlib.write_text. Filed as a cross-platform reproducibility lesson.

What We Learned

Building explainability into a detection system from the start is fundamentally different from bolting it on after. When every signal has a human-readable label from day one, the UI writes itself. When it doesn't, you're reverse-engineering your own model.

We also learned that 24 hours is enough time to build something you'd genuinely use — if you make ruthless decisions about what not to build.

What's Next

Persistent reviewer profiles and cross-session learning via MongoDB
A/B testing different threshold strategies against a labeled ground truth
Extending the Gemini context layer to card testing (time-of-day plausibility per card's normal behavior)
A merchant risk registry that accumulates across sessions

Built With

fastapi
gemini
mongodb
node.js
numpy
pandas
python
react
tailwind
vite
vultr
zustand

Updates

Mohamed-Ltf Loutfi started this project — May 30, 2026 11:52 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.