OCR and the Bitter Lesson
Recently released open source OCR models are starting to replace expert-based OCR systems. This post walks through an evaluation exercise of specialist and general OCR agents.
Notes on Building AI Systems
Recently released open source OCR models are starting to replace expert-based OCR systems. This post walks through an evaluation exercise of specialist and general OCR agents.
Speculative decoding speeds up LLM generation by letting a system propose several “draft” tokens at once, and then having the target model verify them in a single forward pass. The usual question is: where do we get good drafts cheaply? In this post, we explore queue speculation (QueueSpec): draft tokens come from a smaller model that runs while a request is queuing, so verification can start immediately once the request is serviced. At doubleword we use speculative decoding techniques like this and other throughput-specific optimizations to deliver cheaper inference at scale, by sacrificing end to end latency. If you want to get started with some free credits sign up here: Doubleword Platform
Today we’re reducing the price of our highest-intelligence model, Qwen3-235B-A22B-Instruct.
Building a content discovery system using parallel primitives and BST-based ranking with LLM comparisons
A lock-free binary search tree optimized for expensive async comparisons, with threaded linked list for O(1) sorted iteration
High throughput inference of LLMs using JIT weight offloading to optimize KV Cache.
Applying parallel primitives to search and rank 2.4 million arXiv papers using LLM judgments
Exploring coordination patterns from parallel computing for multi-agent LLM systems
Researchers face an impossible task in staying up to date within their field. In AI and Machine Learning alone, arXiv publishes 50-100 new papers daily. Multiply that across computer science, physics, biology, and other domains, and you're looking at hundreds of potentially relevant papers flooding in every single day.
The initial wave of Generative AI adoption focused on augmenting human work - chatbots that help developers write cleaner code, assistants that polish our emails, or tools that speed up content creation. These productivity enhancements have proven their value tenfold, as almost every individual has a version of ChatGPT open to assist them during their day. But they represent just the beginning of what's possible with AI.
This episode explores how speculative decoding becomes increasingly valuable in high-throughput, batched inference scenarios, particularly with sparse MoE architectures.
This technical guide explores model parallelism, a critical technique for deploying large language models that exceed single GPU memory capacity.