Inspiration
Every day, payments companies lose money not because fraud is undetectable — but because reviewing flagged transactions at scale is slow, opaque, and exhausting. Existing tools flag transactions and show a score. They rarely explain why. A human reviewer staring at "score: 0.87" with no context has to make a consequential decision blind.
We wanted to build the tool we'd actually want to use on a Monday morning: one that catches fraud and tells you exactly why it's suspicious, in plain language.
What We Built
Catching Madoff is a full-stack fraud detection and review platform built in 24 hours for the Valpay challenge. It ingests 1,000 credit card transactions, scores every one of them, and surfaces a human-readable review queue where an analyst can triage flagged transactions with full context. No SQL, no guesswork.
The system detects four fraud patterns:
- Card testing — rapid micro-transactions on a single card in a tight window, a stolen card being tested before a big purchase
- Amount spike — a transaction far above a card's normal spend baseline (we model each card's median and flag deviations above a learned threshold)
- Merchant burst — a single merchant hit by multiple different cards in a short window, the signature of a compromised merchant
- New device + spike — a large transaction on a device never seen for that card, a classic account takeover signal
Every flag comes with a human-readable explanation built from the signals that fired, weighted by their log-odds contribution to the final score. The score itself is a calibrated probability:
$$score = \sigma\left(\log\frac{0.07}{0.93} + \sum_i w_i\right)$$
where \( w_i \) are the log-odds weights of each signal and \( \sigma \) is the sigmoid function.
What Makes It Different
Explainability first. Every flagged transaction shows exactly which signals fired and how much each contributed, not just a number. A reviewer sees "$1,275 -> 42× this card's median spend · Gift-card purchase · First time this card uses this device" and can make a confident decision in seconds.
The cost-aware threshold slider. Reviewers can tune the tradeoff between false positives and missed fraud in real time. Moving toward Miss cost flags more transactions; toward FP cost flags fewer. The queue updates live.
Coordinated attack detection. Merchant burst transactions are linked visually the reviewer sees "QuickPay Online was hit by 7 different cards in 70 minutes" and can approve or dismiss the entire cluster in one action. This cross-card signal is invisible to per-card detectors.
Gemini-powered context layer. For merchant burst patterns, Google Gemini silently assesses whether the burst is plausible given the merchant type and time of day. A Tim Hortons receiving $400 transactions at 3 AM is structurally impossible; Amazon at 3 AM is normal. This contextual signal adjusts the effective score invisibly, making the queue smarter without adding UI complexity.
Session learning. Every reviewer decision feeds back into the detection thresholds in real time — the system recalibrates per-pattern confidence based on what the reviewer confirms and dismisses.
How We Built It
We split the work cleanly: one teammate owned the Python detection pipeline (pandas, NumPy, FastAPI), the other owned the React frontend (Vite, Tailwind, Zustand). We agreed on a strict JSON contract on hour one and worked in parallel from hour 2 onward.
The detection pipeline builds per-card baselines from the full transaction history, then scores each transaction by summing signal weights through a sigmoid. The frontend consumes the scored JSON, never touches detection logic, and handles all review state in memory with Zustand.
Challenges
The AliExpress/Spotify trap. 60 transactions in the dataset are cross-border but entirely legitimate (AliExpress ships from China, Spotify bills from Sweden). Naively flagging merchant_country ≠ cardholder_country would have tanked our precision with 60 false positives. We learned to treat geographic mismatch as a combining signal, never a standalone flag.
Unrelated git histories. We initialized our repos independently and hit fatal: refusing to merge unrelated histories mid-hackathon. Since our code lived in /frontend and /backend with no file overlap, --allow-unrelated-histories merged cleanly — but it cost us 20 minutes we didn't have.
Windows encoding. The scoring pipeline crashed on Windows with a UnicodeEncodeError on the sigma character (σ) in our reason strings. One-line fix: encoding="utf-8" on pathlib.write_text. Filed as a cross-platform reproducibility lesson.
What We Learned
Building explainability into a detection system from the start is fundamentally different from bolting it on after. When every signal has a human-readable label from day one, the UI writes itself. When it doesn't, you're reverse-engineering your own model.
We also learned that 24 hours is enough time to build something you'd genuinely use — if you make ruthless decisions about what not to build.
What's Next
- Persistent reviewer profiles and cross-session learning via MongoDB
- A/B testing different threshold strategies against a labeled ground truth
- Extending the Gemini context layer to card testing (time-of-day plausibility per card's normal behavior)
- A merchant risk registry that accumulates across sessions
Log in or sign up for Devpost to join the conversation.