Inspiration

Glassdoor is rumor. Levels.fyi is crowdsourced. But there's a third source almost nobody uses: the US Department of Labor. Every company that sponsors an H-1B or green card must legally disclose the actual salary it pays, by job title and worksite, in PERM and LCA filings. That's millions of federally-verified salary records, public, but buried in giant, bot-blocked spreadsheets no human reads. We wanted to turn that public-but-inaccessible data into something queryable, not just by people, but by the AI agents that will increasingly negotiate, recruit, and advise on our behalf.

What it does

WAGE.md ingests DOL PERM + H-1B LCA disclosures, normalizes job titles to a canonical role/level taxonomy, and publishes a public, cited salary index. It's driven by three autonomous agents (Gemini function calling):

  • Scout :- keeps the corpus fresh: decides which source to poll next and flags anomalies.
  • Investigator :- when an event like a mass layoff appears, it verifies the event against the data, decides if it's material, and republishes the affected company's brief to cited.md on its own.
  • Counselor :- takes your job offer and writes a fully-cited negotiation report: looks up your exact role/level/location, broadens the search when the sample is thin, compares peers, checks layoff context, and recommends a specific ask, every number traced to a real DOL case number.

Ask about a company we haven't loaded and the Counselor fetches it on the fly, it resolves the legal employer name (e.g. "Facebook" → "Meta Inc"), pulls its filings, and answers. A self-extending corpus.

How we built it

  • Nimble fetches the Akamai-bot-blocked DOL and California WARN files (returned base64-encoded), which we decode, filter, and load.
  • ClickHouse Cloud stores it, a filings table plus an aggregating materialized view for sub-100ms percentile queries: ~19,000 real filings across 20+ companies, plus 1,428 CA WARN layoff notices.
  • Gemini 2.5 normalizes raw job titles into role families (SWE/PM/ML/Data/…) and levels L1–L7 using company-specific ladder knowledge.
  • The agents share one generic run_agent() loop (Gemini function calling, a hard iteration cap), specialized into Scout/Investigator/Counselor by their tools and prompts.
  • Senso publishes each company brief to a permanent, public cited.md URL.
  • Lapdog (via ddtrace) captures every iteration, tool call, model call, and query as a deliberately-named span, the trace tree is the visible proof of agency.
  • Next.js 14 + Tailwind front end, with a FastAPI middleware between the app and ClickHouse/agents.

Challenges we ran into

  • The data fights back. DOL and EDD are Akamai bot-blocked, plain requests 403 even with browser headers. Nimble was the key that got us in.
  • The files are enormous, so we filter to the companies in scope and made the corpus self-extend on demand.
  • ClickHouse is append-only and its materialized view only fires on insert, so we reload via a staging-table swap that never risks the live data.
  • Keeping the index honest. The whole thesis is "every number cites a case," so we deliberately refused to publish any fabricated salary numbers, even for demos.
  • A pile of real papercuts: Gemini rate-limit spikes during batch normalization, a Next.js route literally named index crashing the renderer, and per-company publishing so each brief gets its own cited.md URL.

Accomplishments we're proud of

  • Three genuinely autonomous agents, end-to-end, on real federal data, not a pipeline.
  • Live, public cited.md briefs where every claim links to a DOL case number.
  • A self-extending corpus, ask about any company and the agent goes and gets it.
  • Five sponsor tools each doing real, load-bearing work.

What we learned

"Agentic, not a pipeline" is a real constraint: the moment you hard-code a sequence, you've lost the point. Letting Gemini choose its tools, and tracing those choices in Lapdog, is what makes the autonomy believable. And the hard part of "verified data" isn't the model; it's the plumbing: bot-blocked sources, giant files, append-only stores, and the discipline to never invent a number.

What's next

  • A continuous autopilot (Scout → Investigator on a schedule).
  • More WARN states (WA, NY) and more quarters for deeper history.
  • An x402-metered API so other agents can query WAGE.md programmatically.

Built With

Share this project:

Updates