ZenMux's automatic routing feature selects the most cost-effective and high-performing AI models based on your query.
LLaDA2.0-flash-CAP is an enhanced version of LLaDA2.0-flash, which significantly improves inference efficiency by incorporating Confidence-Aware Parallelism (CAP) training technology. Based on a 100B total parameter Mixture of Experts (MoE) diffusion architecture, this model achieves faster parallel decoding speeds while maintaining exceptional performance across various benchmark tests.
Pricing: The model's official pricing is (GLM-4.7: Input $0.28 ~ $0.57, Cached Input $0.057~$0.11, Output $1.14~$2.27). Our platform is currently running a limited-time promotion, during which you will receive a discount on the official price. GLM-4.7 is Zhipu’s latest flagship model. Tailored for agentic coding scenarios, GLM-4.7 strengthens coding capabilities, long-horizon task planning, and tool collaboration, and delivers leading performance among open-source models on the latest leaderboards of multiple public benchmarks. Its general capabilities have also improved, with responses that are more concise and natural and writing that feels more immersive. When executing complex agent tasks and invoking tools, it follows instructions more reliably, while the visual quality of artifacts and agentic coding front ends—as well as long-horizon task completion efficiency—are further enhanced.
MiniMax-M2.1 is a lightweight, state-of-the-art large language model optimized for coding, agentic workflows, and modern application development. With only 10 billion activated parameters, it delivers a major jump in real-world capability while maintaining exceptional latency, scalability, and cost efficiency.
Compared to its predecessor, M2.1 delivers cleaner, more concise outputs and faster perceived response times. It shows leading multilingual coding performance across major systems and application languages, achieving 49.4% on Multi-SWE-Bench and 72.5% on SWE-Bench Multilingual, and serves as a versatile agent “brain” for IDEs, coding tools, and general-purpose assistance.
To avoid degrading this model's performance, MiniMax highly recommends preserving reasoning between turns.
An all-new model purpose-built and optimized for multimodal agent scenarios. Stronger agent capabilities, upgraded multimodal understanding, and more flexible context management
Gemini 3 Flash Preview is a low-latency model in the Gemini 3 family, optimized for fast, high-throughput inference. It retains the core multimodal and reasoning capabilities of Gemini 3 while prioritizing responsiveness and execution efficiency. Built on the same architecture as Gemini 3 Pro, Gemini 3 Flash Preview supports native multimodal inputs—including text, images, and audio—and incorporates the improved reasoning and long-context handling introduced in the Gemini 3 generation. It is designed for real-time and scalable workloads where latency and cost efficiency are primary considerations.
Note: This model is free to use in Studio Chat , but it comes with rate limits and may be temporarily unavailable. For stable, high-availability use, please choose a standard-priced model. Gemini 3 Flash Preview is a low-latency model in the Gemini 3 family, optimized for fast, high-throughput inference. It retains the core multimodal and reasoning capabilities of Gemini 3 while prioritizing responsiveness and execution efficiency.
Built on the same architecture as Gemini 3 Pro, Gemini 3 Flash Preview supports native multimodal inputs—including text, images, and audio—and incorporates the improved reasoning and long-context handling introduced in the Gemini 3 generation. It is designed for real-time and scalable workloads where latency and cost efficiency are primary considerations.
Limited time free. MiMo-V2-Flash is an open-source foundation language model developed by Xiaomi. It is a Mixture-of-Experts model with 309B total parameters and 15B active parameters, adopting hybrid attention architecture. MiMo-V2-Flash supports a hybrid-thinking toggle and a 256K context window, and excels at reasoning, coding, and agent scenarios. On SWE-bench Verified and SWE-bench Multilingual, MiMo-V2-Flash ranks as the top #1 open-source model globally, delivering performance comparable to Claude Sonnet 4.5 while costing only about 3.5% as much.
MiMo-V2-Flash is an open-source foundation language model developed by Xiaomi. It is a Mixture-of-Experts model with 309B total parameters and 15B active parameters, adopting hybrid attention architecture. MiMo-V2-Flash supports a hybrid-thinking toggle and a 256K context window, and excels at reasoning, coding, and agent scenarios. On SWE-bench Verified and SWE-bench Multilingual, MiMo-V2-Flash ranks as the top #1 open-source model globally, delivering performance comparable to Claude Sonnet 4.5 while costing only about 3.5% as much.
GPT-5.2 Pro is OpenAI’s most advanced model, offering major improvements in agentic coding and long context performance over GPT-5 Pro. It is optimized for complex tasks that require step-by-step reasoning, instruction following, and accuracy in high-stakes use cases. It supports test-time routing features and advanced prompt understanding, including user-specified intent like "think hard about this." Improvements include reductions in hallucination, sycophancy, and better performance in coding, writing, and health-related tasks.
GPT-5.2 is the newest frontier-grade model in the GPT-5 family, outperforming GPT-5.1 in agentic capabilities and long-context handling. It uses adaptive reasoning to dynamically allocate compute—responding quickly to straightforward questions while applying deeper analysis to more complex tasks. Designed for broad task coverage, GPT-5.2 delivers consistent improvements across math, coding, science, and tool-calling workloads, producing more coherent long-form responses and more reliable tool use.
GPT-5.2 Chat (AKA Instant) is the fast, lightweight member of the 5.2 family, optimized for low-latency chat while retaining strong general intelligence. It uses adaptive reasoning to selectively “think” on harder queries, improving accuracy on math, coding, and multi-step tasks without slowing down typical conversations. The model is warmer and more conversational by default, with better instruction following and more stable short-form reasoning. GPT-5.2 Chat is designed for high-throughput, interactive workloads where responsiveness and consistency matter more than deep deliberation.
GLM-4.6V represents a significant evolution of the GLM series into the multimodal domain. It features a 128k-token training context window and sets a new state-of-the-art in visual understanding accuracy for its parameter scale. Pioneeringly, it is the first model to natively integrate tool-calling capabilities into its visual architecture, bridging the gap from visual perception to executable actions. This makes it a unified technical foundation for multimodal Agents in real-world business scenarios. Pricing: The model's official pricing is (GLM-4.6v: Input $0.3, Cached Input $0.05, Output $0.9). Our platform is currently running a limited-time promotion, during which you will receive a discount on the official price.
GLM-4.6V-Flash is the free version of GLM-4.6V, representing a significant iteration in the GLM series for multimodal capabilities. It supports toggling reasoning modes, with a training-time context window expanded to 128k tokens. Achieving state-of-the-art (SOTA) visual understanding accuracy at its parameter scale, it is the first visual model to natively integrate Function Call capability into its architecture. This establishes a seamless pipeline from "visual perception" to "executable actions (Action)", offering a unified technical foundation for multimodal agents in real-world business scenarios.
GLM-4.6V-FlashX is the the paid version, offering higher capacity and stability, representing a significant iteration in the GLM series for multimodal capabilities. It supports toggling reasoning modes, with a training-time context window expanded to 128k tokens. Achieving state-of-the-art (SOTA) visual understanding accuracy at its parameter scale, it is the first visual model to natively integrate Function Call capability into its architecture. This establishes a seamless pipeline from "visual perception" to "executable actions (Action)", offering a unified technical foundation for multimodal agents in real-world business scenarios.
DeepSeek-V3.2 is a reasoning-first large language model released by DeepSeek as the official successor to V3.2-Exp. It is designed with a focus on agentic capabilities and integrated reasoning for tool-use scenarios.
If the request to the deepseek-reasoner model includes the tools parameter, the request will actually be processed using the deepseek-chat model.
The model introduces a new large-scale agent training data synthesis method covering over 1,800 environments and 85,000+ complex instructions. DeepSeek-V3.2 is the first model from DeepSeek to integrate thinking directly into tool-use, supporting both thinking and non-thinking modes during tool interactions.
According to DeepSeek's benchmarks, the model delivers performance comparable to GPT-5 level while balancing inference quality and output length. It is available via App, Web, and API, making it suitable for general-purpose daily use as well as complex agentic workflows.
Mistral Large 3, is a state-of-the-art, open-weight, general-purpose multimodal model with a granular Mixture-of-Experts architecture. It features 41B active parameters and 675B total parameters.
DeepSeek-V3.2 (Non-thinking Mode) is DeepSeek's latest production model, currently served under the deepseek-chat model slug, which is automatically updated as new versions are released.
DeepSeek-V3.2 is a reasoning-first large language model released by DeepSeek as the official successor to V3.2-Exp. It is designed with a focus on agentic capabilities and integrated reasoning for tool-use scenarios.
If the request to the deepseek-reasoner model includes the tools parameter, the request will actually be processed using the deepseek-chat model.
The model introduces a new large-scale agent training data synthesis method covering over 1,800 environments and 85,000+ complex instructions. DeepSeek-V3.2 is the first model from DeepSeek to integrate thinking directly into tool-use, supporting both thinking and non-thinking modes during tool interactions.
According to DeepSeek's benchmarks, the model delivers performance comparable to GPT-5 level while balancing inference quality and output length. It is available via App, Web, and API, making it suitable for general-purpose daily use as well as complex agentic workflows.
DeepSeek-V3.2 (Thinking Mode) is DeepSeek's latest production model, currently served under the deepseek-reasoner model slug, which is automatically updated as new versions are released.
DeepSeek-V3.2 is a reasoning-first large language model released by DeepSeek as the official successor to V3.2-Exp. It is designed with a focus on agentic capabilities and integrated reasoning for tool-use scenarios.
If the request to the deepseek-reasoner model includes the tools parameter, the request will actually be processed using the deepseek-chat model.
The model introduces a new large-scale agent training data synthesis method covering over 1,800 environments and 85,000+ complex instructions. DeepSeek-V3.2 is the first model from DeepSeek to integrate thinking directly into tool-use, supporting both thinking and non-thinking modes during tool interactions.
According to DeepSeek's benchmarks, the model delivers performance comparable to GPT-5 level while balancing inference quality and output length. It is available via App, Web, and API, making it suitable for general-purpose daily use as well as complex agentic workflows.
Claude Opus 4.5 is Anthropic's latest frontier reasoning model, purpose-built for complex software engineering, agentic workflows, and long-horizon computer use. It delivers strong multimodal capabilities, competitive performance on real-world coding and reasoning benchmarks, and improved robustness against prompt injection attacks. The model is designed to operate efficiently across varied effort levels, allowing developers to balance speed, depth, and token usage based on their specific task requirements—you can fine-tune token efficiency through the OpenRouter Verbosity parameter, which offers low, medium, and high settings. Beyond that, Opus 4.5 supports advanced tool use, extended context management, and coordinated multi-agent setups, making it ideal for autonomous research, debugging, multi-step planning, and spreadsheet or browser manipulation. Compared to previous Opus generations, it brings substantial improvements in structured reasoning, execution reliability, and alignment, while reducing token overhead and delivering more consistent performance on long-running tasks.
Nano Banana Pro is an AI image generation model that is a significant upgrade from its predecessor, built on Google's Gemini 3 Pro. It promises to move beyond simple pattern matching to a more reasoning-driven system with improved physics understanding, text rendering, and image consistency. Key features include faster processing, native 2K resolution, and the ability to edit existing images with greater control, aiming to produce more reliable and professional-grade results.
Grok 4.1 Fast is xAI's best agentic tool calling model that shines in real-world use cases like customer support and deep research. 2M context window.
Reasoning can be enabled/disabled using the reasoning enabled parameter in the API.
Grok 4.1 Fast Non Reasoning is xAI's best agentic tool calling model that shines in real-world use cases like customer support and deep research. 2M context window.
Gemini 3 Pro is the next generation in the Gemini series of models, a suite of highly-capable, natively multimodal, reasoning models. Gemini 3 Pro is now Google’s most advanced model for complex tasks, and can comprehend vast datasets, challenging problems from different information sources, including text, audio, images, video, and entire code repositories
GPT-5.1 is the latest frontier-grade model in the GPT-5 series, offering stronger general-purpose reasoning, improved instruction adherence, and a more natural conversational style compared to GPT-5. It uses adaptive reasoning to allocate computation dynamically, responding quickly to simple queries while spending more depth on complex tasks. The model produces clearer, more grounded explanations with reduced jargon, making it easier to follow even on technical or multi-step problems.
Built for broad task coverage, GPT-5.1 delivers consistent gains across math, coding, and structured analysis workloads, with more coherent long-form answers and improved tool-use reliability. It also features refined conversational alignment, enabling warmer, more intuitive responses without compromising precision. GPT-5.1 serves as the primary full-capability successor to GPT-5
GPT-5.1 Chat (AKA Instant is the fast, lightweight member of the 5.1 family, optimized for low-latency chat while retaining strong general intelligence. It uses adaptive reasoning to selectively “think” on harder queries, improving accuracy on math, coding, and multi-step tasks without slowing down typical conversations. The model is warmer and more conversational by default, with better instruction following and more stable short-form reasoning. GPT-5.1 Chat is designed for high-throughput, interactive workloads where responsiveness and consistency matter more than deep deliberation.
GPT-5.1-Codex is a specialized version of GPT-5.1 optimized for software engineering and coding workflows. It is designed for both interactive development sessions and long, independent execution of complex engineering tasks. The model supports building projects from scratch, feature development, debugging, large-scale refactoring, and code review. Compared to GPT-5.1, Codex is more steerable, adheres closely to developer instructions, and produces cleaner, higher-quality code outputs.
Codex integrates into developer environments including the CLI, IDE extensions, GitHub, and cloud tasks. It adapts reasoning effort dynamically—providing fast responses for small tasks while sustaining extended multi-hour runs for large projects. The model is trained to perform structured code reviews, catching critical flaws by reasoning over dependencies and validating behavior against tests. It also supports multimodal inputs such as images or screenshots for UI development and integrates tool use for search, dependency installation, and environment setup. Codex is intended specifically for agentic coding applications.
GPT-5.1-Codex-Mini is a smaller and faster version of GPT-5.1-Codex
The new-generation Wenxin model, Wenxin 5.0, is a natively multimodal large model. It adopts a native unified multimodal modeling approach to jointly model text, images, audio, and video, providing comprehensive multimodal capabilities. Wenxin 5.0’s core abilities have been comprehensively upgraded and it performs excellently on benchmark datasets, with particularly strong results in multimodal understanding, instruction following, creative writing, factuality, agent planning, and tool use
Doubao-Seed-Code has been deeply optimized for Agentic Programming tasks, delivering exceptional performance across multiple authoritative benchmarks — including Terminal Bench, SWE-Bench-Verified-Openhands, and Multi-SWE-Bench-Flash-Openhands — outperforming domestic counterparts and supporting a context window of up to 256k tokens.
A thinking model with general agentic and reasoning capabilities, specializing in deep reasoning tasks
Context length 256k. High-speed version of kimi-k2-thinking, suitable for scenarios requiring both deep reasoning and extremely fast responses
Please note: This is a toggle thinking model. To enable "thinking" mode, you need to set "enable_thinking=True".
This is a preview version of the Max model in the Tongyi Qianwen 3 series, achieving an effective integration of thinking and non-thinking modes. In thinking mode, there is a significant enhancement in capabilities such as intelligent agent programming, common-sense reasoning, and reasoning across mathematics, science, and general domains.
Ming-flash-omni Preview is a multimodal AI model that supports text, speech, image, and video inputs while generating text, speech, and image outputs. Built on a sparse 100-billion-parameter Mixture-of-Experts architecture with 6 billion active parameters per token, it achieves state-of-the-art speech recognition across 12 ContextASR benchmarks and delivers significant improvements for 15 Chinese dialects. The model introduces high-fidelity text rendering in image generation, enhanced scene consistency, and superior identity preservation during editing. Its innovative generative segmentation capability unifies segmentation and editing into a semantics-preserving framework, achieving a 0.90 score on GenEval for fine-grained spatial control. A Dual-Balanced Routing Mechanism ensures stable cross-modal training through auxiliary load balancing loss and modality-level router bias updates. Compared to Ming-lite-omni v1.5, Ming-flash-omni Preview offers substantial advancements in architecture efficiency, editing precision, and speech understanding, establishing itself as a highly competitive solution among leading multimodal models.
MiniMax-M2 is a compact, high-efficiency large language model optimized for end-to-end coding and agentic workflows. With 10 billion activated parameters (230 billion total), it delivers near-frontier intelligence across general reasoning, tool use, and multi-step task execution while maintaining low latency and deployment efficiency. The model excels in code generation, multi-file editing, compile-run-fix loops, and test-validated repair, showing strong results on SWE-Bench Verified, Multi-SWE-Bench, and Terminal-Bench. It also performs competitively in agentic evaluations such as BrowseComp and GAIA, effectively handling long-horizon planning, retrieval, and recovery from execution errors. Benchmarked by Artificial Analysis, MiniMax-M2 ranks among the top open-source models for composite intelligence, spanning mathematics, science, and instruction-following. Its small activation footprint enables fast inference, high concurrency, and improved unit economics, making it well-suited for large-scale agents, developer assistants, and reasoning-driven applications that require responsiveness and cost efficiency.
Limited time free, KAT-Coder-Pro V1 possesses advanced intelligent agent capabilities such as multi-tool parallel invocation, enabling autonomous completion of complex tasks with fewer interactions, featuring stronger code comprehension and logical reasoning, delivering ultimate performance for AI Coding.
Limited time free, KAT-Coder-Pro V1 possesses advanced intelligent agent capabilities such as multi-tool parallel invocation, enabling autonomous completion of complex tasks with fewer interactions, featuring stronger code comprehension and logical reasoning, delivering ultimate performance for AI Coding.
Claude Haiku 4.5 is Anthropic’s fastest and most efficient model, delivering near-frontier intelligence at a fraction of the cost and latency of larger Claude models. Matching Claude Sonnet 4’s performance across reasoning, coding, and computer-use tasks, Haiku 4.5 brings frontier-level capability to real-time and high-volume applications.
It introduces extended thinking to the Haiku line; enabling controllable reasoning depth, summarized or interleaved thought output, and tool-assisted workflows with full support for coding, bash, web search, and computer-use tools. Scoring >73% on SWE-bench Verified, Haiku 4.5 ranks among the world’s best coding models while maintaining exceptional responsiveness for sub-agents, parallelized execution, and scaled deployment.
Ring-1T is a trillion-parameter sparse mixture-of-experts (MoE) thinking model developed by inclusionAI. It adopts the Ling 2.0 architecture and is trained on the Ling-1T-base foundation model, which contains 1 trillion total parameters with 50 billion activated parameters, supporting a context window of up to 128K tokens. Building upon the preview version released at the end of September, Ring-1T has undergone continued scaling with large-scale verifiable reward reinforcement learning (RLVR) training, further unlocking the natural language reasoning capabilities of the trillion-parameter foundation model.
Ling-1T is a trillion-parameter sparse mixture-of-experts (MoE) model developed by inclusionAI, optimized for efficient and scalable reasoning. Featuring approximately 50 billion active parameters per token, it is pre-trained on over 20 trillion reasoning-dense tokens, supports a 128K context length, and utilizes an Evolutionary Chain-of-Thought (Evo-CoT) process to enhance its reasoning depth. The model achieves state-of-the-art performance across complex benchmarks, demonstrating strong capabilities in code generation, software development, and advanced mathematics. In addition to its core reasoning skills, Ling-1T possesses specialized abilities in front-end code generation—combining semantic understanding with visual aesthetics—and exhibits emergent agentic capabilities, such as proficient tool use with minimal instruction tuning. Its primary use cases span software engineering, professional mathematics, complex logical reasoning, and agent-based workflows that demand a balance of high performance and efficiency.
Gemini 2.5 Flash Image, a.k.a. "Nano Banana," is now generally available. It is a state of the art image generation model with contextual understanding. It is capable of image generation, edits, and multi-turn conversations.
GPT-5 Pro is OpenAI’s most advanced model, offering major improvements in reasoning, code quality, and user experience. It is optimized for complex tasks that require step-by-step reasoning, instruction following, and accuracy in high-stakes use cases. It supports test-time routing features and advanced prompt understanding, including user-specified intent like "think hard about this." Improvements include reductions in hallucination, sycophancy, and better performance in coding, writing, and health-related tasks.
GLM-4.6 is a flagship model from Zhishen with 355B total parameters and 32B active parameters. The model's context window has been expanded from 128K to 200K, enabling it to handle longer code and agent tasks. In programming capabilities, GLM-4.6's performance is comparable to Claude Sonnet 4 on public benchmarks and real-world programming tasks. The model supports tool calling during inference and features optimized search and tool-use performance within agent frameworks. Furthermore, enhancements have been made to its writing style, readability, role-playing, and cross-lingual task processing abilities.
Pricing: The model's official pricing is (GLM-4.6: Input $0.6, Cached Input $0.11, Output $2.2). Our platform is currently running a limited-time promotion, during which you will receive a discount on the official price.
Claude Sonnet 4.5 is Anthropic’s most advanced Sonnet model to date, optimized for real-world agents and coding workflows. It delivers state-of-the-art performance on coding benchmarks such as SWE-bench Verified, with improvements across system design, code security, and specification adherence. The model is designed for extended autonomous operation, maintaining task continuity across sessions and providing fact-based progress tracking.
Sonnet 4.5 also introduces stronger agentic capabilities, including improved tool orchestration, speculative parallel execution, and more efficient context and memory management. With enhanced context tracking and awareness of token usage across tool calls, it is particularly well-suited for multi-context and long-running workflows. Use cases span software engineering, cybersecurity, financial analysis, research agents, and other domains requiring sustained reasoning and tool use.
DeepSeek-V3.2-Exp is an experimental model version, serving as an intermediate step toward the next-generation architecture. Built on the foundation of V3.1-Terminus, it introduces the DeepSeek Sparse Attention (DSA) mechanism—a sparse attention mechanism designed to explore and validate the optimization of training and inference efficiency in long-context scenarios. This experimental version represents the team's continuous research on more efficient Transformer architectures, with a specific focus on improving computational efficiency when processing long text sequences. For the first time, DSA enables fine-grained sparse attention, which significantly enhances the efficiency of long-context training and inference while maintaining almost unchanged model output quality.
GPT-5-Codex is a specialized version of GPT-5 optimized for software engineering and coding workflows. It is designed for both interactive development sessions and long, independent execution of complex engineering tasks. The model supports building projects from scratch, feature development, debugging, large-scale refactoring, and code review. Compared to GPT-5, Codex is more steerable, adheres closely to developer instructions, and produces cleaner, higher-quality code outputs. Reasoning effort can be adjusted with the reasoning.effort parameter. Read the docs here
Codex integrates into developer environments including the CLI, IDE extensions, GitHub, and cloud tasks. It adapts reasoning effort dynamically—providing fast responses for small tasks while sustaining extended multi-hour runs for large projects. The model is trained to perform structured code reviews, catching critical flaws by reasoning over dependencies and validating behavior against tests. It also supports multimodal inputs such as images or screenshots for UI development and integrates tool use for search, dependency installation, and environment setup. Codex is intended specifically for agentic coding applications.
Qwen3-Max is an updated release built on the Qwen3 series, offering major improvements in reasoning, instruction following, multilingual support, and long-tail knowledge coverage compared to the January 2025 version. It delivers higher accuracy in math, coding, logic, and science tasks, follows complex instructions in Chinese and English more reliably, reduces hallucinations, and produces higher-quality responses for open-ended Q&A, writing, and conversation. The model supports over 100 languages with stronger translation and commonsense reasoning, and is optimized for retrieval-augmented generation (RAG) and tool calling, though it does not include a dedicated “thinking” mode.
The Qwen3 series VL models effectively integrates thinking and non-thinking modes, achieving world-leading performance in visual agent capabilities on public benchmark datasets such as OS World. This version features comprehensive upgrades in areas like visual coding, spatial perception, and multimodal reasoning, significantly enhancing visual perception and recognition abilities, and supporting the understanding of ultra-long videos.
Grok 4 Fast is xAI's latest multimodal model with SOTA cost-efficiency and a 2M token context window. It comes in two flavors: non-reasoning and reasoning. Read more about the model on xAI's news post. Reasoning can be enabled using the reasoning enabled parameter in the API. Learn more in our docs
Prompts and completions may be used by xAI or OpenRouter to improve future models.
Grok 4 Fast is xAI's latest multimodal model with SOTA cost-efficiency and a 2M token context window. It comes in two flavors: non-reasoning and reasoning. Read more about the model on xAI's news post. Reasoning can be enabled using the reasoning enabled parameter in the API. Learn more in our docs
Prompts and completions may be used by xAI or OpenRouter to improve future models.
Ling-flash-2.0 is an open-source Mixture-of-Experts (MoE) language model developed under the Ling 2.0 architecture. It features 100 billion total parameters, with 6.1 billion activated during inference (4.8B non-embedding).
Trained on over 20 trillion tokens and refined with supervised fine-tuning and multi-stage reinforcement learning, the model demonstrates strong performance against dense models up to 40B parameters. It excels in complex reasoning, code generation, and frontend development.
Trained on over 20 trillion tokens and refined with supervised fine-tuning and multi-stage reinforcement learning, the model demonstrates strong performance against dense models up to 40B parameters. It excels in complex reasoning, code generation, and frontend development.
The Wenxin Large Model X1.1 delivers significantly enhanced performance in question answering, tool invocation, agent capabilities, instruction following, logical reasoning, mathematical problem-solving, and coding tasks, with markedly improved factual accuracy. Its context window has been extended to 64K tokens, enabling longer inputs and dialogue histories, while maintaining response speed and improving the coherence of long-chain reasoning.
Kimi K2 0905 is the September update of Kimi K2 0711 [blocked]. It is a large-scale Mixture-of-Experts (MoE) language model developed by Moonshot AI, featuring 1 trillion total parameters with 32 billion active per forward pass. It supports long-context inference up to 256k tokens, extended from the previous 128k.
This update improves agentic coding with higher accuracy and better generalization across scaffolds, and enhances frontend coding with more aesthetic and functional outputs for web, 3D, and related tasks. Kimi K2 is optimized for agentic capabilities, including advanced tool use, reasoning, and code synthesis. It excels across coding (LiveCodeBench, SWE-bench), reasoning (ZebraLogic, GPQA), and tool-use (Tau2, AceBench) benchmarks. The model is trained with a novel stack incorporating the MuonClip optimizer for stable large-scale MoE training.
Ling-mini-2.0 is an open-source Mixture-of-Experts (MoE) large language model designed to balance strong task performance with high inference efficiency. It has 16B total parameters, with approximately 1.4B activated per token (about 789M non-embedding). Trained on over 20T tokens and refined via multi-stage supervised fine-tuning and reinforcement learning, it is reported to deliver strong results in complex reasoning and instruction following while keeping computational costs low. According to the upstream release, it reaches top-tier performance among sub-10B dense LLMs and in some cases matches or surpasses larger MoE models.
Ring-mini-2.0 is a Mixture-of-Experts (MoE) model oriented toward high-throughput inference and extensively optimized on the Ling 2.0 architecture. It uses 16B total parameters with approximately 1.4B activated per token and is reported to deliver comprehensive reasoning performance comparable to sub-10B dense LLMs. The model shows strong results on logical reasoning, code generation, and mathematical tasks, supports 128K context windows, and reports generation speeds of 300+ tokens per second.
Grok Code Fast 1 is a speedy and economical reasoning model that excels at agentic coding. With reasoning traces visible in the response, developers can steer Grok Code for high-quality work flows.
DeepSeek-V3.1 is a large hybrid reasoning model (671B parameters, 37B active) that supports both thinking and non-thinking modes via prompt templates. It extends the DeepSeek-V3 base with a two-phase long-context training process, reaching up to 128K tokens, and uses FP8 microscaling for efficient inference.
The model improves tool use, code generation, and reasoning efficiency, achieving performance comparable to DeepSeek-R1 on difficult benchmarks while responding more quickly. It supports structured tool calling, code agents, and search agents, making it suitable for research, coding, and agentic workflows.
It succeeds the DeepSeek V3-0324 model and performs well on a variety of tasks.
Doubao 1.6 Visual Deep Reasoning Model, featuring stronger general multimodal understanding and reasoning capabilities
GPT-5 is OpenAI’s most advanced model, offering major improvements in reasoning, code quality, and user experience. It is optimized for complex tasks that require step-by-step reasoning, instruction following, and accuracy in high-stakes use cases. It supports test-time routing features and advanced prompt understanding, including user-specified intent like "think hard about this." Improvements include reductions in hallucination, sycophancy, and better performance in coding, writing, and health-related tasks.
GPT-5 Chat is designed for advanced, natural, multimodal, and context-aware conversations for enterprise applications.
GPT-5 Mini is a compact version of GPT-5, designed to handle lighter-weight reasoning tasks. It provides the same instruction-following and safety-tuning benefits as GPT-5, but with reduced latency and cost. GPT-5 Mini is the successor to OpenAI's o4-mini model.
GPT-5-Nano is the smallest and fastest variant in the GPT-5 system, optimized for developer tools, rapid interactions, and ultra-low latency environments. While limited in reasoning depth compared to its larger counterparts, it retains key instruction-following and safety features. It is the successor to GPT-4.1-nano and offers a lightweight option for cost-sensitive or real-time applications.
Claude Opus 4.1 is an updated version of Anthropic’s flagship model, offering improved performance in coding, reasoning, and agentic tasks. It achieves 74.5% on SWE-bench Verified and shows notable gains in multi-file code refactoring, debugging precision, and detail-oriented reasoning. The model supports extended thinking up to 64K tokens and is optimized for tasks involving research, data analysis, and tool-assisted reasoning.
Step-3 is a brand-new multimodal reasoning model that can process both image and text inputs and produce text responses. It is capable of deep thinking and can autonomously carry out a reasoning process. That is, before generating the final output, it completes a “thinking” phase (for example, presenting reasoning information via a reasoning field), which improves the accuracy of the final result and the depth of reasoning. When calling the model, developers do not need to preset too many system prompts (sys_prompt), as the model can automatically leverage its built-in deep-thinking capability._
GLM-4.5 is our latest flagship foundation model, purpose-built for agent-based applications. It leverages a Mixture-of-Experts (MoE) architecture and supports a context length of up to 128k tokens. GLM-4.5 delivers significantly enhanced capabilities in reasoning, code generation, and agent alignment. It supports a hybrid inference mode with two options, a "thinking mode" designed for complex reasoning and tool use, and a "non-thinking mode" optimized for instant responses. Users can control the reasoning behaviour with the reasoning enabled boolean.
GLM-4.5-Air is the lightweight variant of our latest flagship model family, also purpose-built for agent-centric applications. Like GLM-4.5, it adopts the Mixture-of-Experts (MoE) architecture but with a more compact parameter size. GLM-4.5-Air also supports hybrid inference modes, offering a "thinking mode" for advanced reasoning and tool use, and a "non-thinking mode" for real-time interaction. Users can control the reasoning behaviour with the reasoning enabled boolean. Learn more in our docs
Powered by Qwen3, this is a powerful Coding Agent that excels in tool calling and environment interaction to achieve autonomous programming. It combines outstanding coding proficiency with versatile general-purpose abilities.
Gemini 2.5 Flash-Lite is a lightweight reasoning model in the Gemini 2.5 family, optimized for ultra-low latency and cost efficiency. It offers improved throughput, faster token generation, and better performance across common benchmarks compared to earlier Flash models. By default, "thinking" (i.e. multi-pass reasoning) is disabled to prioritize speed, but developers can enable it via the Reasoning API parameter to selectively trade off cost for intelligence.
Qwen3-235B-A22B-Instruct-2507 is a multilingual, instruction-tuned mixture-of-experts language model based on the Qwen3-235B architecture, with 22B active parameters per forward pass. It is optimized for general-purpose text generation, including instruction following, logical reasoning, math, code, and tool usage. The model supports a native 262K context length and does not implement "thinking mode" (<think> blocks).
Compared to its base variant, this version delivers significant gains in knowledge coverage, long-context reasoning, coding benchmarks, and alignment with open-ended tasks. It is particularly strong on multilingual understanding, math reasoning (e.g., AIME, HMMT), and alignment evaluations like Arena-Hard and WritingBench.
Qwen3-235B-A22B-Thinking-2507 is a high-performance, open-weight Mixture-of-Experts (MoE) language model optimized for complex reasoning tasks. It activates 22B of its 235B parameters per forward pass and natively supports up to 262,144 tokens of context. This "thinking-only" variant enhances structured logical reasoning, mathematics, science, and long-form generation, showing strong benchmark performance across AIME, SuperGPQA, LiveCodeBench, and MMLU-Redux. It enforces a special reasoning mode (</think>) and is designed for high-token outputs (up to 81,920 tokens) in challenging domains.
The model is instruction-tuned and excels at step-by-step reasoning, tool use, agentic workflows, and multilingual tasks. This release represents the most capable open-source variant in the Qwen3-235B series, surpassing many closed models in structured reasoning use cases.
Qwen3-Coder-480B-A35B-Instruct is a Mixture-of-Experts (MoE) code generation model developed by the Qwen team. It is optimized for agentic coding tasks such as function calling, tool use, and long-context reasoning over repositories. The model features 480 billion total parameters, with 35 billion active per forward pass (8 out of 160 experts).
Pricing for the Alibaba endpoints varies by context length. Once a request is greater than 128k input tokens, the higher pricing is used.
Kimi K2 Instruct is a large-scale Mixture-of-Experts (MoE) language model developed by Moonshot AI, featuring 1 trillion total parameters with 32 billion active per forward pass. It is optimized for agentic capabilities, including advanced tool use, reasoning, and code synthesis. Kimi K2 excels across a broad range of benchmarks, particularly in coding (LiveCodeBench, SWE-bench), reasoning (ZebraLogic, GPQA), and tool-use (Tau2, AceBench) tasks. It supports long-context inference up to 128K tokens and is designed with a novel training stack that includes the MuonClip optimizer for stable large-scale MoE training.
Grok 4 is xAI's latest reasoning model with a 256k context window. It supports parallel tool calling, structured outputs, and both image and text inputs. Note that reasoning is not exposed, reasoning cannot be disabled, and the reasoning effort cannot be specified. Pricing increases once the total tokens in a given request is greater than 128k tokens. See more details on the xAI docs
Gemini 2.5 Flash is Google's state-of-the-art workhorse model, specifically designed for advanced reasoning, coding, mathematics, and scientific tasks. It includes built-in "thinking" capabilities, enabling it to provide responses with greater accuracy and nuanced context handling.
Additionally, Gemini 2.5 Flash is configurable through the "max tokens for reasoning" parameter, as described in the documentation.
Gemini 2.5 Pro is Google’s state-of-the-art AI model designed for advanced reasoning, coding, mathematics, and scientific tasks. It employs “thinking” capabilities, enabling it to reason through responses with enhanced accuracy and nuanced context handling. Gemini 2.5 Pro achieves top-tier performance on multiple benchmarks, including first-place positioning on the LMArena leaderboard, reflecting superior human-preference alignment and complex problem-solving abilities.
May 28th update to the original DeepSeek R1 Performance on par with OpenAI o1, but open-sourced and with fully open reasoning tokens. It's 671B parameters in size, with 37B active in an inference pass. Fully open-source model.
Claude Opus 4 is benchmarked as the world’s best coding model, at time of release, bringing sustained performance on complex, long-running tasks and agent workflows. It sets new benchmarks in software engineering, achieving leading results on SWE-bench (72.5%) and Terminal-bench (43.2%). Opus 4 supports extended, agentic workflows, handling thousands of task steps continuously for hours without degradation.
Claude Sonnet 4 significantly enhances the capabilities of its predecessor, Sonnet 3.7, excelling in both coding and reasoning tasks with improved precision and controllability. Achieving state-of-the-art performance on SWE-bench (72.7%), Sonnet 4 balances capability and computational efficiency, making it suitable for a broad range of applications from routine coding tasks to complex software development projects. Key enhancements include improved autonomous codebase navigation, reduced error rates in agent-driven workflows, and increased reliability in following intricate instructions. Sonnet 4 is optimized for practical everyday use, providing advanced reasoning capabilities while maintaining efficiency and responsiveness in diverse internal and external scenarios.
Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3 12B is the second largest in the family of Gemma 3 models after Gemma 3 27B [blocked]
Qwen3-14B is a dense 14.8B parameter causal language model from the Qwen3 series, designed for both complex reasoning and efficient dialogue. It supports seamless switching between a "thinking" mode for tasks like math, programming, and logical inference, and a "non-thinking" mode for general-purpose conversation. The model is fine-tuned for instruction-following, agent tool use, creative writing, and multilingual tasks across 100+ languages and dialects. It natively handles 32K token contexts and can extend to 131K tokens using YaRN-based scaling.
OpenAI o4-mini is a compact reasoning model in the o-series, optimized for fast, cost-efficient performance while retaining strong multimodal and agentic capabilities. It supports tool use and demonstrates competitive reasoning and coding performance across benchmarks like AIME (99.5% with Python) and SWE-bench, outperforming its predecessor o3-mini and even approaching o3 in some domains.
Despite its smaller size, o4-mini exhibits high accuracy in STEM tasks, visual problem solving (e.g., MathVista, MMMU), and code editing. It is especially well-suited for high-throughput scenarios where latency or cost is critical. Thanks to its efficient architecture and refined reinforcement learning training, o4-mini can chain tools, generate structured outputs, and solve multi-step tasks with minimal delay—often in under a minute.
GPT-4.1 is a flagship large language model optimized for advanced instruction following, real-world software engineering, and long-context reasoning. It supports a 1 million token context window and outperforms GPT-4o and GPT-4.5 across coding (54.6% SWE-bench Verified), instruction compliance (87.4% IFEval), and multimodal understanding benchmarks. It is tuned for precise code diffs, agent reliability, and high recall in large document contexts, making it ideal for agents, IDE tooling, and enterprise knowledge retrieval.
GPT-4.1 Mini is a mid-sized model delivering performance competitive with GPT-4o at substantially lower latency and cost. It retains a 1 million token context window and scores 45.1% on hard instruction evals, 35.8% on MultiChallenge, and 84.1% on IFEval. Mini also shows strong coding ability (e.g., 31.6% on Aider’s polyglot diff benchmark) and vision understanding, making it suitable for interactive applications with tight performance constraints.
For tasks that demand low latency, GPT‑4.1 nano is the fastest and cheapest model in the GPT-4.1 series. It delivers exceptional performance at a small size with its 1 million token context window, and scores 80.1% on MMLU, 50.3% on GPQA, and 9.8% on Aider polyglot coding – even higher than GPT‑4o mini. It’s ideal for tasks like classification or autocompletion.
Llama 4 Scout 17B Instruct (16E) is a mixture-of-experts (MoE) language model developed by Meta, activating 17 billion parameters out of a total of 109B. It supports native multimodal input (text and image) and multilingual output (text and code) across 12 supported languages. Designed for assistant-style interaction and visual reasoning, Scout uses 16 experts per forward pass and features a context length of 10 million tokens, with a training corpus of ~40 trillion tokens. Built for high efficiency and local or commercial deployment, Llama 4 Scout incorporates early fusion for seamless modality integration. It is instruction-tuned for use in multilingual chat, captioning, and image understanding tasks. Released under the Llama 4 Community License, it was last trained on data up to August 2024 and launched publicly on April 5, 2025.
Gemini 2.0 Flash Lite offers a significantly faster time to first token (TTFT) compared to Gemini Flash 1.5, while maintaining quality on par with larger models like Gemini Pro 1.5, all at extremely economical token prices.
Claude 3.7 Sonnet is an advanced large language model with improved reasoning, coding, and problem-solving capabilities. It introduces a hybrid reasoning approach, allowing users to choose between rapid responses and extended, step-by-step processing for complex tasks. The model demonstrates notable improvements in coding, particularly in front-end development and full-stack updates, and excels in agentic workflows, where it can autonomously navigate multi-step processes.
Claude 3.7 Sonnet maintains performance parity with its predecessor in standard mode while offering an extended reasoning mode for enhanced accuracy in math, coding, and instruction-following tasks.
Gemini Flash 2.0 offers a significantly faster time to first token (TTFT) compared to Gemini Flash 1.5, while maintaining quality on par with larger models like Gemini Pro 1.5. It introduces notable enhancements in multimodal understanding, coding capabilities, complex instruction following, and function calling. These advancements come together to deliver more seamless and robust agentic experiences.
The Meta Llama 3.3 multilingual large language model (LLM) is a pretrained and instruction tuned generative model in 70B (text in/text out). The Llama 3.3 instruction tuned text only model is optimized for multilingual dialogue use cases and outperforms many of the available open source and closed chat models on common industry benchmarks.
Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
Claude 3.5 Haiku features offers enhanced capabilities in speed, coding accuracy, and tool use. Engineered to excel in real-time applications, it delivers quick response times that are essential for dynamic tasks such as chat interactions and immediate coding suggestions.
This makes it highly suitable for environments that demand both speed and precision, such as software development, customer service bots, and data management systems.
Retiring soon; the model has fallen back to Sonnet 3.7.
New Claude 3.5 Sonnet delivers better-than-Opus capabilities, faster-than-Sonnet speeds, at the same Sonnet prices. Sonnet is particularly good at:
GPT-4o mini is OpenAI's newest model after GPT-4 Omni, supporting both text and image inputs with text outputs.
As their most advanced small model, it is many multiples more affordable than other recent frontier models, and more than 60% cheaper than GPT-3.5 Turbo. It maintains SOTA intelligence, while being significantly more cost-effective.
GPT-4o mini achieves an 82% score on MMLU and presently ranks higher than GPT-4 on chat preferences common leaderboards.
GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs with text outputs. It maintains the intelligence level of GPT-4 Turbowhile being twice as fast and 50% more cost-effective. GPT-4o also offers improved performance in processing non-English languages and enhanced visual capabilities.
For benchmarking against other models, it was briefly called "im-also-a-good-gpt2-chatbot"