Labelbox’s cover photo
Labelbox

Labelbox

Software Development

San Francisco, California 36,104 followers

The data factory for leading AI teams

About us

Labelbox is the data factory for leading AI labs and AI-powered enterprises. Innovate faster using Labelbox’s on-demand expert labeling services and unified software to deliver high-quality, frontier data with control and speed.

Website
https://labelbox.com/
Industry
Software Development
Company size
51-200 employees
Headquarters
San Francisco, California
Type
Privately Held
Founded
2018

Locations

Employees at Labelbox

Updates

  • This week, we had the pleasure of hosting 50+ researchers and builders from leading AI companies to meet, talk and socialize (MTS 😎) at Labelbox HQ. Huge thanks to Dwarkesh Patel, Sholto Douglas (Anthropic), Mo Bavarian (OpenAI), and Melvin Johnson (DeepMind) for leading our fireside chat on scaling RL and the pursuit of AGI.

    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
      +1
  • 🏆 Forbes’ 2026 list of America’s Best Startup Employers is out, and we’re proud to see Labelbox on the list. We’re committed to enabling the next generation of AI by powering the data and evaluation for the world’s most advanced teams. Recognition like this reflects the people building that mission every day. See the full list: https://bit.ly/4u8CumB

  • Voice agents are evolving from rigid turn-based designs toward continuous, natural conversation, enabling streaming comprehension and generation at the same time. However, most existing benchmarks are either turn-based or latency-focused and do not directly test whether models can maintain reasoning when users interrupt or update objectives mid-utterance. We introduce EchoChain 🔊, a novel benchmark for evaluating reasoning under pressure in full-duplex dialogue. Key findings: - Full-duplex models often fail to properly integrate interruption information, even so far as ignoring the interruption entirely in some cases. - A major weakness in today’s most advanced models is that they struggle to stay consistent when new input arrives while they’re still responding. - In many cases, a model performs well when it can respond without interruption, but struggles once it’s interrupted mid-response. Check out the full analysis in our blog post. Stay tuned for the arXiv paper as well which will be released in the coming days. https://lnkd.in/g3QkNZdb

  • Model safety is often judged by refusal rates on AI safety benchmarks. But what if our evaluations are flagging overtly negative or sensitive language rather than detecting genuine adversarial behavior? In our latest research, we show that when this language is removed, frontier models previously labeled as safe frequently fail, exposing a gap between how model safety is evaluated using benchmarks and how adversarial behavior occurs in the real world. Key findings: - AI safety benchmarks are over-reliant on explicit triggering language, provoking model refusals unrealistically. - Removing these cues significantly degrades safety performance, challenging prior assumptions about the robustness of safety evaluations. - We found evidence that both internal safety evaluations and safety alignment techniques use similar language patterns, further questioning the robustness of safety evaluations. - Our novel “intent laundering” framework serves as a strong diagnostic and red-teaming tool, exposing where model safety succeeds and where it fails. Read the full blog post for the complete analysis. https://lnkd.in/g84dywcR

  • View organization page for Labelbox

    36,104 followers

    Today, Dario (CEO of Anthropic) x Dwarkesh unpacked where AI is headed, from exponential scaling to what he calls a “country of geniuses in a data center". A few key takeaways: - RL is about generalization, not specialization: Like early pretraining, the goal isn’t mastering one task, but building rich environments and broad data so models generalize across domains. - 1–3 years to a “country of geniuses”: Dario estimates ~50/50 odds that AI systems collectively match the output of an entire nation of top experts in a few years. Not a single superintelligence, but millions of genius-level systems in parallel. - Context as the next unlock: With context windows in the tens of millions of tokens, models could absorb months of workflow in one pass. The goal: steerable, human-aligned systems, as opposed to unchecked autonomous actors. - Software engineering goes end to end: Models are moving from writing code to executing full engineering cycles: setup, debugging, iteration. Bottlenecks now shift from syntax to judgment. - Diffusion will lag capability, briefly: Enterprise adoption slows even with rapid growth, but AI can onboard itself via docs, Slack threads, and codebases. By compressing the adoption curve, trillions in AI-driven revenue by 2030 becomes realistic. Excited to be featured in this conversation, showcasing how we help leading AI teams build high-fidelity RL environments and tighten the iteration loop so models learn from the most informative experiences.

    • No alternative text description for this image
  • We're excited to share that we’ve acquired Upcraft to bring AI agents to the heart of how we scale human expertise for frontier AI. Upcraft’s AI-powered automation strengthens Alignerr by helping us recruit, engage, and empower a global network of domain experts who train and evaluate the world’s most advanced models. As leading AI teams invest billions into post-training and reinforcement learning, expert-generated data has become the true bottleneck for injecting models with the taste and judgement that only deep human expertise can provide. A big welcome to Greg Caplan and the Upcraft team and we look forward to building together. https://lnkd.in/g4rjRNeA

  • View organization page for Labelbox

    36,104 followers

    Elon x Dwarkesh x John Collison from Stripe just went live. Their almost three hour chat (over some Guinness 🍻) dives into what actually limits the next phase of AI and how Elon plans to break through. A few takeaways from this must-watch episode: - Space as the next data center: Solar power in orbit is roughly five times more effective than on Earth. Within thirty to thirty six months, Musk believes space could become the most economically viable location for AI compute, with Starship launching massive power and compute capacity into orbit. - Humanoid robots as the economic unlock: Optimus could be the ultimate productivity multiplier, potentially expanding the global economy by orders of magnitude. The hardest problem is hands. The endgame is robots that eventually build robots. - Power as the next bottleneck: Electricity production outside China is flat while compute demand is exploding. Musk says the true scaling wall for AI on Earth is utilities, not just models. - Debuggability as a safety requirement: Tools that show where a model’s reasoning went wrong, trace the origin of errors, or detect potential deception will be essential as AI grows more capable. - Efficiency as an existential issue: Interest on national debt now exceeds the military budget. Musk argues that massive productivity gains from AI and robotics are not optional. They are existential. We’re excited to be featured in the conversation, helping leading AI teams scale high quality robotics and reinforcement learning data so their models learn from the right experiences and reach their full potential.

    • No alternative text description for this image
  • A few research takeaways from NeurIPS 2025, pointing toward a 2026 focused on rigorous evaluation and continually learning AI systems: - Evaluation moves to the core: Data contamination, shortcut learning, and unfaithful benchmarks increasingly blur the line between genuine capability gains and test data overfitting. Designing tasks that faithfully target underlying capabilities is now a first-order research problem and opportunity. - Agents everywhere: The field is moving beyond static foundation models toward interactive agents, with reinforcement learning re-emerging as infrastructure for continual, experience-driven improvement at inference time. - From inflection to consolidation: Expect benchmarks that deliberately surface failure modes, alongside agentic systems that learn across multi-turn interaction in complex, dynamic environments. At Labelbox, these themes directly shape our work. We’re building high-signal, contamination-resistant datasets and capability-focused evaluations to more faithfully measure the performance of frontier AI systems and uncover their failure modes. https://lnkd.in/gsjcn_dK

  • Our Labelbox holiday party this week at the beautifully designed Hedge Coffee was full of great vibes and even greater people. As the team took turns on the turntables with espresso martinis in hand, we celebrated everything we’ve built together this year, while getting energized for a big year ahead.

    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
      +3

Similar pages

Browse jobs

Funding