Input Modalities

Text
Image
File
Audio
Video

Output Modalities

Text
Image
File
Audio
Video

Context Length

4K64K1M

Series

GPT
Claude
Gemini
Show More

Providers

OpenAI
Anthropic
Tbox
Show More

Supported Parameters

max_completion_tokens
temperature
top_p
Show More

Supported Protocol

OpenAI Chat Completions
OpenAI Responses
Anthropic Messages
Google VertexAI

Reasoning

No Reasoning
Toggleable Reasoning
Always-On Reasoning

Models

94 models
Newest

ZenMux's automatic routing feature selects the most cost-effective and high-performing AI models based on your query.

Input type
Output Type
Input-
Output-
Context-
Max Output-
inclusionAI: LLaDA2-flash-CAP

inclusionAI: LLaDA2-flash-CAP

inclusionai/llada2.0-flash-cap
11.17M tokens

LLaDA2.0-flash-CAP is an enhanced version of LLaDA2.0-flash, which significantly improves inference efficiency by incorporating Confidence-Aware Parallelism (CAP) training technology. Based on a 100B total parameter Mixture of Experts (MoE) diffusion architecture, this model achieves faster parallel decoding speeds while maintaining exceptional performance across various benchmark tests.

Input type
Output Type
Input$0.28/M tokens
Output$2.85/M tokens
Context32.00K
Max Output32.00K
Z.AI: GLM 4.7

Z.AI: GLM 4.7

z-ai/glm-4.7
159.77M tokens

Pricing: The model's official pricing is (GLM-4.7: Input $0.28 ~ $0.57, Cached Input $0.057~$0.11, Output $1.14~$2.27). Our platform is currently running a limited-time promotion, during which you will receive a discount on the official price. GLM-4.7 is Zhipu’s latest flagship model. Tailored for agentic coding scenarios, GLM-4.7 strengthens coding capabilities, long-horizon task planning, and tool collaboration, and delivers leading performance among open-source models on the latest leaderboards of multiple public benchmarks. Its general capabilities have also improved, with responses that are more concise and natural and writing that feels more immersive. When executing complex agent tasks and invoking tools, it follows instructions more reliably, while the visual quality of artifacts and agentic coding front ends—as well as long-horizon task completion efficiency—are further enhanced.

Input type
Output Type
Input$0.28 - 0.57/M tokens
Output$1.14 - 2.27/M tokens
Context200.00K
Max Output128.00K
MiniMax: MiniMax M2.1

MiniMax: MiniMax M2.1

minimax/minimax-m2.1
26.33M tokens

MiniMax-M2.1 is a lightweight, state-of-the-art large language model optimized for coding, agentic workflows, and modern application development. With only 10 billion activated parameters, it delivers a major jump in real-world capability while maintaining exceptional latency, scalability, and cost efficiency.

Compared to its predecessor, M2.1 delivers cleaner, more concise outputs and faster perceived response times. It shows leading multilingual coding performance across major systems and application languages, achieving 49.4% on Multi-SWE-Bench and 72.5% on SWE-Bench Multilingual, and serves as a versatile agent “brain” for IDEs, coding tools, and general-purpose assistance.

To avoid degrading this model's performance, MiniMax highly recommends preserving reasoning between turns.

Input type
Output Type
Input$0.30/M tokens
Output$1.20/M tokens
Context204.80K
Max Output131.07K
VolcanoEngine: Doubao-Seed-1.8

VolcanoEngine: Doubao-Seed-1.8

volcengine/doubao-seed-1.8
866.47M tokens

An all-new model purpose-built and optimized for multimodal agent scenarios. Stronger agent capabilities, upgraded multimodal understanding, and more flexible context management

Input type
Output Type
Input$0.11 - 0.34/M tokens
Output$0.28 - 3.41/M tokens
Context256.00K
Max Output32.00K
Google: Gemini 3 Flash Preview

Google: Gemini 3 Flash Preview

google/gemini-3-flash-preview
2606.77M tokens

Gemini 3 Flash Preview is a low-latency model in the Gemini 3 family, optimized for fast, high-throughput inference. It retains the core multimodal and reasoning capabilities of Gemini 3 while prioritizing responsiveness and execution efficiency. Built on the same architecture as Gemini 3 Pro, Gemini 3 Flash Preview supports native multimodal inputs—including text, images, and audio—and incorporates the improved reasoning and long-context handling introduced in the Gemini 3 generation. It is designed for real-time and scalable workloads where latency and cost efficiency are primary considerations.

Input type
Output Type
Input$0.50/M tokens
Output$3.00/M tokens
Context1.05M
Max Output65.53K
Google: Gemini 3 Flash Preview Free

Google: Gemini 3 Flash Preview Free

google/gemini-3-flash-preview-free
Free
1398.96M tokens

Note: This model is free to use in Studio Chat , but it comes with rate limits and may be temporarily unavailable. For stable, high-availability use, please choose a standard-priced model. Gemini 3 Flash Preview is a low-latency model in the Gemini 3 family, optimized for fast, high-throughput inference. It retains the core multimodal and reasoning capabilities of Gemini 3 while prioritizing responsiveness and execution efficiency.

Built on the same architecture as Gemini 3 Pro, Gemini 3 Flash Preview supports native multimodal inputs—including text, images, and audio—and incorporates the improved reasoning and long-context handling introduced in the Gemini 3 generation. It is designed for real-time and scalable workloads where latency and cost efficiency are primary considerations.

Input type
Output Type
Input$0.50$0/M tokens
Output$3.00$0/M tokens
Context1.05M
Max Output65.53K
Xiaomi: MiMo-V2-Flash

Xiaomi: MiMo-V2-Flash

xiaomi/mimo-v2-flash
2161.83M tokens

Limited time free. MiMo-V2-Flash is an open-source foundation language model developed by Xiaomi. It is a Mixture-of-Experts model with 309B total parameters and 15B active parameters, adopting hybrid attention architecture. MiMo-V2-Flash supports a hybrid-thinking toggle and a 256K context window, and excels at reasoning, coding, and agent scenarios. On SWE-bench Verified and SWE-bench Multilingual, MiMo-V2-Flash ranks as the top #1 open-source model globally, delivering performance comparable to Claude Sonnet 4.5 while costing only about 3.5% as much.

Input type
Output Type
Input$0.00/M tokens
Output$0.00/M tokens
Context262.14K
Max Output262.14K
Xiaomi: MiMo-V2-Flash Free

Xiaomi: MiMo-V2-Flash Free

xiaomi/mimo-v2-flash-free
Free
91.38M tokens

MiMo-V2-Flash is an open-source foundation language model developed by Xiaomi. It is a Mixture-of-Experts model with 309B total parameters and 15B active parameters, adopting hybrid attention architecture. MiMo-V2-Flash supports a hybrid-thinking toggle and a 256K context window, and excels at reasoning, coding, and agent scenarios. On SWE-bench Verified and SWE-bench Multilingual, MiMo-V2-Flash ranks as the top #1 open-source model globally, delivering performance comparable to Claude Sonnet 4.5 while costing only about 3.5% as much.

Input type
Output Type
Input$0.00$0/M tokens
Output$0.00$0/M tokens
Context262.14K
Max Output262.14K
OpenAI: GPT-5.2 Pro

OpenAI: GPT-5.2 Pro

openai/gpt-5.2-pro
80.14M tokens

GPT-5.2 Pro is OpenAI’s most advanced model, offering major improvements in agentic coding and long context performance over GPT-5 Pro. It is optimized for complex tasks that require step-by-step reasoning, instruction following, and accuracy in high-stakes use cases. It supports test-time routing features and advanced prompt understanding, including user-specified intent like "think hard about this." Improvements include reductions in hallucination, sycophancy, and better performance in coding, writing, and health-related tasks.

Input type
Output Type
Input$21.00/M tokens
Output$168.00/M tokens
Context400.00K
Max Output128.00K
OpenAI: GPT-5.2

OpenAI: GPT-5.2

openai/gpt-5.2
753.06M tokens

GPT-5.2 is the newest frontier-grade model in the GPT-5 family, outperforming GPT-5.1 in agentic capabilities and long-context handling. It uses adaptive reasoning to dynamically allocate compute—responding quickly to straightforward questions while applying deeper analysis to more complex tasks. Designed for broad task coverage, GPT-5.2 delivers consistent improvements across math, coding, science, and tool-calling workloads, producing more coherent long-form responses and more reliable tool use.

Input type
Output Type
Input$1.75/M tokens
Output$14.00/M tokens
Context400.00K
Max Output128.00K
OpenAI: GPT-5.2 Chat

OpenAI: GPT-5.2 Chat

openai/gpt-5.2-chat
72.14M tokens

GPT-5.2 Chat (AKA Instant) is the fast, lightweight member of the 5.2 family, optimized for low-latency chat while retaining strong general intelligence. It uses adaptive reasoning to selectively “think” on harder queries, improving accuracy on math, coding, and multi-step tasks without slowing down typical conversations. The model is warmer and more conversational by default, with better instruction following and more stable short-form reasoning. GPT-5.2 Chat is designed for high-throughput, interactive workloads where responsiveness and consistency matter more than deep deliberation.

Input type
Output Type
Input$1.75/M tokens
Output$14.00/M tokens
Context128.00K
Max Output16.38K
Z.AI: GLM 4.6V

Z.AI: GLM 4.6V

z-ai/glm-4.6v
11.60M tokens

GLM-4.6V represents a significant evolution of the GLM series into the multimodal domain. It features a 128k-token training context window and sets a new state-of-the-art in visual understanding accuracy for its parameter scale. Pioneeringly, it is the first model to natively integrate tool-calling capabilities into its visual architecture, bridging the gap from visual perception to executable actions. This makes it a unified technical foundation for multimodal Agents in real-world business scenarios. Pricing: The model's official pricing is (GLM-4.6v: Input $0.3, Cached Input $0.05, Output $0.9). Our platform is currently running a limited-time promotion, during which you will receive a discount on the official price.

Input type
Output Type
Input$0.14 - 0.28/M tokens
Output$0.42 - 0.85/M tokens
Context200.00K
Max Output128.00K
Z.AI: GLM 4.6V Flash (Free)

Z.AI: GLM 4.6V Flash (Free)

z-ai/glm-4.6v-flash-free
Free
108.86M tokens

GLM-4.6V-Flash is the free version of GLM-4.6V, representing a significant iteration in the GLM series for multimodal capabilities. It supports toggling reasoning modes, with a training-time context window expanded to 128k tokens. Achieving state-of-the-art (SOTA) visual understanding accuracy at its parameter scale, it is the first visual model to natively integrate Function Call capability into its architecture. This establishes a seamless pipeline from "visual perception" to "executable actions (Action)", offering a unified technical foundation for multimodal agents in real-world business scenarios.

Input type
Output Type
Input$0.02 - 0.04$0/M tokens
Output$0.21 - 0.43$0/M tokens
Context200.00K
Max Output128.00K
Z.AI: GLM 4.6V FlashX

Z.AI: GLM 4.6V FlashX

z-ai/glm-4.6v-flash
Free
286.34M tokens

GLM-4.6V-FlashX is the the paid version, offering higher capacity and stability, representing a significant iteration in the GLM series for multimodal capabilities. It supports toggling reasoning modes, with a training-time context window expanded to 128k tokens. Achieving state-of-the-art (SOTA) visual understanding accuracy at its parameter scale, it is the first visual model to natively integrate Function Call capability into its architecture. This establishes a seamless pipeline from "visual perception" to "executable actions (Action)", offering a unified technical foundation for multimodal agents in real-world business scenarios.

Input type
Output Type
Input$0.02 - 0.04$0/M tokens
Output$0.21 - 0.43$0/M tokens
Context200.00K
Max Output128.00K
DeepSeek: DeepSeek V3.2

DeepSeek: DeepSeek V3.2

deepseek/deepseek-v3.2
14.39M tokens

DeepSeek-V3.2 is a reasoning-first large language model released by DeepSeek as the official successor to V3.2-Exp. It is designed with a focus on agentic capabilities and integrated reasoning for tool-use scenarios.

If the request to the deepseek-reasoner model includes the tools parameter, the request will actually be processed using the deepseek-chat model.

The model introduces a new large-scale agent training data synthesis method covering over 1,800 environments and 85,000+ complex instructions. DeepSeek-V3.2 is the first model from DeepSeek to integrate thinking directly into tool-use, supporting both thinking and non-thinking modes during tool interactions.

According to DeepSeek's benchmarks, the model delivers performance comparable to GPT-5 level while balancing inference quality and output length. It is available via App, Web, and API, making it suitable for general-purpose daily use as well as complex agentic workflows.

Input type
Output Type
Input$0.28/M tokens
Output$0.43/M tokens
Context128.00K
Max Output8.00K
Mistral: Mistral Large 3

Mistral: Mistral Large 3

mistralai/mistral-large-2512
57.93M tokens

Mistral Large 3, is a state-of-the-art, open-weight, general-purpose multimodal model with a granular Mixture-of-Experts architecture. It features 41B active parameters and 675B total parameters.

Input type
Output Type
Input$0.50/M tokens
Output$1.50/M tokens
Context256.00K
Max Output256.00K
1232.01M tokens

DeepSeek-V3.2 (Non-thinking Mode) is DeepSeek's latest production model, currently served under the deepseek-chat model slug, which is automatically updated as new versions are released.

DeepSeek-V3.2 is a reasoning-first large language model released by DeepSeek as the official successor to V3.2-Exp. It is designed with a focus on agentic capabilities and integrated reasoning for tool-use scenarios.

If the request to the deepseek-reasoner model includes the tools parameter, the request will actually be processed using the deepseek-chat model.

The model introduces a new large-scale agent training data synthesis method covering over 1,800 environments and 85,000+ complex instructions. DeepSeek-V3.2 is the first model from DeepSeek to integrate thinking directly into tool-use, supporting both thinking and non-thinking modes during tool interactions.

According to DeepSeek's benchmarks, the model delivers performance comparable to GPT-5 level while balancing inference quality and output length. It is available via App, Web, and API, making it suitable for general-purpose daily use as well as complex agentic workflows.

Input type
Output Type
Input$0.28/M tokens
Output$0.42/M tokens
Context128.00K
Max Output8.00K
DeepSeek: DeepSeek-V3.2 (Thinking Mode)

DeepSeek: DeepSeek-V3.2 (Thinking Mode)

deepseek/deepseek-reasoner
626.45M tokens

DeepSeek-V3.2 (Thinking Mode) is DeepSeek's latest production model, currently served under the deepseek-reasoner model slug, which is automatically updated as new versions are released.

DeepSeek-V3.2 is a reasoning-first large language model released by DeepSeek as the official successor to V3.2-Exp. It is designed with a focus on agentic capabilities and integrated reasoning for tool-use scenarios.

If the request to the deepseek-reasoner model includes the tools parameter, the request will actually be processed using the deepseek-chat model.

The model introduces a new large-scale agent training data synthesis method covering over 1,800 environments and 85,000+ complex instructions. DeepSeek-V3.2 is the first model from DeepSeek to integrate thinking directly into tool-use, supporting both thinking and non-thinking modes during tool interactions.

According to DeepSeek's benchmarks, the model delivers performance comparable to GPT-5 level while balancing inference quality and output length. It is available via App, Web, and API, making it suitable for general-purpose daily use as well as complex agentic workflows.

Input type
Output Type
Input$0.28/M tokens
Output$0.42/M tokens
Context128.00K
Max Output64.00K
Anthropic: Claude Opus 4.5

Anthropic: Claude Opus 4.5

anthropic/claude-opus-4.5
3072.89M tokens

Claude Opus 4.5 is Anthropic's latest frontier reasoning model, purpose-built for complex software engineering, agentic workflows, and long-horizon computer use. It delivers strong multimodal capabilities, competitive performance on real-world coding and reasoning benchmarks, and improved robustness against prompt injection attacks. The model is designed to operate efficiently across varied effort levels, allowing developers to balance speed, depth, and token usage based on their specific task requirements—you can fine-tune token efficiency through the OpenRouter Verbosity parameter, which offers low, medium, and high settings. Beyond that, Opus 4.5 supports advanced tool use, extended context management, and coordinated multi-agent setups, making it ideal for autonomous research, debugging, multi-step planning, and spreadsheet or browser manipulation. Compared to previous Opus generations, it brings substantial improvements in structured reasoning, execution reliability, and alignment, while reducing token overhead and delivering more consistent performance on long-running tasks.

Input type
Output Type
Input$5.00/M tokens
Output$25.00/M tokens
Context200.00K
Max Output32.00K
Google: Gemini 3 Pro Image (Nano Banana Pro)

Google: Gemini 3 Pro Image (Nano Banana Pro)

google/gemini-3-pro-image-preview
758.20M tokens

Nano Banana Pro is an AI image generation model that is a significant upgrade from its predecessor, built on Google's Gemini 3 Pro. It promises to move beyond simple pattern matching to a more reasoning-driven system with improved physics understanding, text rendering, and image consistency. Key features include faster processing, native 2K resolution, and the ability to edit existing images with greater control, aiming to produce more reliable and professional-grade results.

Input type
Output Type
Input$2.00 - 4.00/M tokens
Output$12.00 - 18.00/M tokens
Context65.54K
Max Output32.77K
xAI: Grok 4.1 Fast

xAI: Grok 4.1 Fast

x-ai/grok-4.1-fast
1110.32M tokens

Grok 4.1 Fast is xAI's best agentic tool calling model that shines in real-world use cases like customer support and deep research. 2M context window.

Reasoning can be enabled/disabled using the reasoning enabled parameter in the API.

Input type
Output Type
Input$0.20 - 0.40/M tokens
Output$0.50 - 1.00/M tokens
Context2.00M
Max Output30.00K
xAI: Grok 4.1 Fast Non Reasoning

xAI: Grok 4.1 Fast Non Reasoning

x-ai/grok-4.1-fast-non-reasoning
301.86M tokens

Grok 4.1 Fast Non Reasoning is xAI's best agentic tool calling model that shines in real-world use cases like customer support and deep research. 2M context window.

Input type
Output Type
Input$0.20 - 0.40/M tokens
Output$0.50 - 1.00/M tokens
Context2.00M
Max Output30.00K
Google: Gemini 3 Pro Preview

Google: Gemini 3 Pro Preview

google/gemini-3-pro-preview
5018.90M tokens

Gemini 3 Pro is the next generation in the Gemini series of models, a suite of highly-capable, natively multimodal, reasoning models. Gemini 3 Pro is now Google’s most advanced model for complex tasks, and can comprehend vast datasets, challenging problems from different information sources, including text, audio, images, video, and entire code repositories

Input type
Output Type
Input$2.00 - 4.00/M tokens
Output$12.00 - 18.00/M tokens
Context1.05M
Max Output65.53K
OpenAI: GPT-5.1

OpenAI: GPT-5.1

openai/gpt-5.1
558.01M tokens

GPT-5.1 is the latest frontier-grade model in the GPT-5 series, offering stronger general-purpose reasoning, improved instruction adherence, and a more natural conversational style compared to GPT-5. It uses adaptive reasoning to allocate computation dynamically, responding quickly to simple queries while spending more depth on complex tasks. The model produces clearer, more grounded explanations with reduced jargon, making it easier to follow even on technical or multi-step problems.

Built for broad task coverage, GPT-5.1 delivers consistent gains across math, coding, and structured analysis workloads, with more coherent long-form answers and improved tool-use reliability. It also features refined conversational alignment, enabling warmer, more intuitive responses without compromising precision. GPT-5.1 serves as the primary full-capability successor to GPT-5

Input type
Output Type
Input$1.25/M tokens
Output$10.00/M tokens
Context400.00K
Max Output128.00K
OpenAI: GPT-5.1 Chat

OpenAI: GPT-5.1 Chat

openai/gpt-5.1-chat
99.40M tokens

GPT-5.1 Chat (AKA Instant is the fast, lightweight member of the 5.1 family, optimized for low-latency chat while retaining strong general intelligence. It uses adaptive reasoning to selectively “think” on harder queries, improving accuracy on math, coding, and multi-step tasks without slowing down typical conversations. The model is warmer and more conversational by default, with better instruction following and more stable short-form reasoning. GPT-5.1 Chat is designed for high-throughput, interactive workloads where responsiveness and consistency matter more than deep deliberation.

Input type
Output Type
Input$1.25/M tokens
Output$10.00/M tokens
Context128.00K
Max Output16.38K
OpenAI: GPT-5.1-Codex

OpenAI: GPT-5.1-Codex

openai/gpt-5.1-codex
362.31M tokens

GPT-5.1-Codex is a specialized version of GPT-5.1 optimized for software engineering and coding workflows. It is designed for both interactive development sessions and long, independent execution of complex engineering tasks. The model supports building projects from scratch, feature development, debugging, large-scale refactoring, and code review. Compared to GPT-5.1, Codex is more steerable, adheres closely to developer instructions, and produces cleaner, higher-quality code outputs.

Codex integrates into developer environments including the CLI, IDE extensions, GitHub, and cloud tasks. It adapts reasoning effort dynamically—providing fast responses for small tasks while sustaining extended multi-hour runs for large projects. The model is trained to perform structured code reviews, catching critical flaws by reasoning over dependencies and validating behavior against tests. It also supports multimodal inputs such as images or screenshots for UI development and integrates tool use for search, dependency installation, and environment setup. Codex is intended specifically for agentic coding applications.

Input type
Output Type
Input$1.25/M tokens
Output$10.00/M tokens
Context400.00K
Max Output128.00K
OpenAI: GPT-5.1-Codex-Mini

OpenAI: GPT-5.1-Codex-Mini

openai/gpt-5.1-codex-mini
264.18M tokens

GPT-5.1-Codex-Mini is a smaller and faster version of GPT-5.1-Codex

Input type
Output Type
Input$0.25/M tokens
Output$2.00/M tokens
Context400.00K
Max Output100.00K
Baidu: ERNIE-5.0-Thinking-Preview

Baidu: ERNIE-5.0-Thinking-Preview

baidu/ernie-5.0-thinking-preview
481.41M tokens

The new-generation Wenxin model, Wenxin 5.0, is a natively multimodal large model. It adopts a native unified multimodal modeling approach to jointly model text, images, audio, and video, providing comprehensive multimodal capabilities. Wenxin 5.0’s core abilities have been comprehensively upgraded and it performs excellently on benchmark datasets, with particularly strong results in multimodal understanding, instruction following, creative writing, factuality, agent planning, and tool use

Input type
Output Type
Input$0.84 - 1.41/M tokens
Output$3.37 - 5.62/M tokens
Context128.00K
Max Output64.00K
VolcanoEngine: Doubao-Seed-Code

VolcanoEngine: Doubao-Seed-Code

volcengine/doubao-seed-code
1214.83M tokens

Doubao-Seed-Code has been deeply optimized for Agentic Programming tasks, delivering exceptional performance across multiple authoritative benchmarks — including Terminal Bench, SWE-Bench-Verified-Openhands, and Multi-SWE-Bench-Flash-Openhands — outperforming domestic counterparts and supporting a context window of up to 256k tokens.

Input type
Output Type
Input$0.17 - 0.39/M tokens
Output$1.12 - 2.25/M tokens
Context256.00K
Max Output32.00K
MoonshotAI: Kimi K2 Thinking

MoonshotAI: Kimi K2 Thinking

moonshotai/kimi-k2-thinking
341.07M tokens

A thinking model with general agentic and reasoning capabilities, specializing in deep reasoning tasks

Input type
Output Type
Input$0.60/M tokens
Output$2.50/M tokens
Context262.14K
Max Output262.14K
MoonshotAI: Kimi K2 Thinking Turbo

MoonshotAI: Kimi K2 Thinking Turbo

moonshotai/kimi-k2-thinking-turbo
52.84M tokens

Context length 256k. High-speed version of kimi-k2-thinking, suitable for scenarios requiring both deep reasoning and extremely fast responses

Input type
Output Type
Input$1.15/M tokens
Output$8.00/M tokens
Context262.14K
Max Output262.14K
Qwen: Qwen3 Max Thinking Preview

Qwen: Qwen3 Max Thinking Preview

qwen/qwen3-max-preview
20.42M tokens

Please note: This is a toggle thinking model. To enable "thinking" mode, you need to set "enable_thinking=True".

This is a preview version of the Max model in the Tongyi Qianwen 3 series, achieving an effective integration of thinking and non-thinking modes. In thinking mode, there is a significant enhancement in capabilities such as intelligent agent programming, common-sense reasoning, and reasoning across mathematics, science, and general domains.

Input type
Output Type
Input$1.20 - 3.00/M tokens
Output$6.00 - 15.00/M tokens
Context262.14K
Max Output65.54K
inclusionAI: Ming-flash-omni Preview

inclusionAI: Ming-flash-omni Preview

inclusionai/ming-flash-omni-preview
7.41M tokens

Ming-flash-omni Preview is a multimodal AI model that supports text, speech, image, and video inputs while generating text, speech, and image outputs. Built on a sparse 100-billion-parameter Mixture-of-Experts architecture with 6 billion active parameters per token, it achieves state-of-the-art speech recognition across 12 ContextASR benchmarks and delivers significant improvements for 15 Chinese dialects. The model introduces high-fidelity text rendering in image generation, enhanced scene consistency, and superior identity preservation during editing. Its innovative generative segmentation capability unifies segmentation and editing into a semantics-preserving framework, achieving a 0.90 score on GenEval for fine-grained spatial control. A Dual-Balanced Routing Mechanism ensures stable cross-modal training through auxiliary load balancing loss and modality-level router bias updates. Compared to Ming-lite-omni v1.5, Ming-flash-omni Preview offers substantial advancements in architecture efficiency, editing precision, and speech understanding, establishing itself as a highly competitive solution among leading multimodal models.

Input type
Output Type
Input$0.80/M tokens
Output$1.80/M tokens
Context64.00K
Max Output32.00K
MiniMax: MiniMax M2

MiniMax: MiniMax M2

minimax/minimax-m2
421.21M tokens

MiniMax-M2 is a compact, high-efficiency large language model optimized for end-to-end coding and agentic workflows. With 10 billion activated parameters (230 billion total), it delivers near-frontier intelligence across general reasoning, tool use, and multi-step task execution while maintaining low latency and deployment efficiency. The model excels in code generation, multi-file editing, compile-run-fix loops, and test-validated repair, showing strong results on SWE-Bench Verified, Multi-SWE-Bench, and Terminal-Bench. It also performs competitively in agentic evaluations such as BrowseComp and GAIA, effectively handling long-horizon planning, retrieval, and recovery from execution errors. Benchmarked by Artificial Analysis, MiniMax-M2 ranks among the top open-source models for composite intelligence, spanning mathematics, science, and instruction-following. Its small activation footprint enables fast inference, high concurrency, and improved unit economics, making it well-suited for large-scale agents, developer assistants, and reasoning-driven applications that require responsiveness and cost efficiency.

Input type
Output Type
Input$0.30/M tokens
Output$1.20/M tokens
Context204.80K
Max Output128.00K
KwaiKAT: KAT-Coder-Pro-V1

KwaiKAT: KAT-Coder-Pro-V1

kuaishou/kat-coder-pro-v1
1359.41M tokens

Limited time free, KAT-Coder-Pro V1 possesses advanced intelligent agent capabilities such as multi-tool parallel invocation, enabling autonomous completion of complex tasks with fewer interactions, featuring stronger code comprehension and logical reasoning, delivering ultimate performance for AI Coding.

Input type
Output Type
Input$0.00/M tokens
Output$0.00/M tokens
Context256.00K
Max Output32.00K
KwaiKAT: KAT-Coder-Pro-V1 Free

KwaiKAT: KAT-Coder-Pro-V1 Free

kuaishou/kat-coder-pro-v1-free
Free
15.09M tokens

Limited time free, KAT-Coder-Pro V1 possesses advanced intelligent agent capabilities such as multi-tool parallel invocation, enabling autonomous completion of complex tasks with fewer interactions, featuring stronger code comprehension and logical reasoning, delivering ultimate performance for AI Coding.

Input type
Output Type
Input$0.00$0/M tokens
Output$0.00$0/M tokens
Context256.00K
Max Output32.00K
Anthropic: Claude Haiku 4.5

Anthropic: Claude Haiku 4.5

anthropic/claude-haiku-4.5
2147.35M tokens

Claude Haiku 4.5 is Anthropic’s fastest and most efficient model, delivering near-frontier intelligence at a fraction of the cost and latency of larger Claude models. Matching Claude Sonnet 4’s performance across reasoning, coding, and computer-use tasks, Haiku 4.5 brings frontier-level capability to real-time and high-volume applications.

It introduces extended thinking to the Haiku line; enabling controllable reasoning depth, summarized or interleaved thought output, and tool-assisted workflows with full support for coding, bash, web search, and computer-use tools. Scoring >73% on SWE-bench Verified, Haiku 4.5 ranks among the world’s best coding models while maintaining exceptional responsiveness for sub-agents, parallelized execution, and scaled deployment.

Input type
Output Type
Input$1.00/M tokens
Output$5.00/M tokens
Context200.00K
Max Output64.00K
inclusionAI: Ring-1T

inclusionAI: Ring-1T

inclusionai/ring-1t
881.41M tokens

Ring-1T is a trillion-parameter sparse mixture-of-experts (MoE) thinking model developed by inclusionAI. It adopts the Ling 2.0 architecture and is trained on the Ling-1T-base foundation model, which contains 1 trillion total parameters with 50 billion activated parameters, supporting a context window of up to 128K tokens. Building upon the preview version released at the end of September, Ring-1T has undergone continued scaling with large-scale verifiable reward reinforcement learning (RLVR) training, further unlocking the natural language reasoning capabilities of the trillion-parameter foundation model.

Input type
Output Type
Input$0.56/M tokens
Output$2.24/M tokens
Context128.00K
Max Output32.00K
inclusionAI: Ling-1T

inclusionAI: Ling-1T

inclusionai/ling-1t
693.40M tokens

Ling-1T is a trillion-parameter sparse mixture-of-experts (MoE) model developed by inclusionAI, optimized for efficient and scalable reasoning. Featuring approximately 50 billion active parameters per token, it is pre-trained on over 20 trillion reasoning-dense tokens, supports a 128K context length, and utilizes an Evolutionary Chain-of-Thought (Evo-CoT) process to enhance its reasoning depth. The model achieves state-of-the-art performance across complex benchmarks, demonstrating strong capabilities in code generation, software development, and advanced mathematics. In addition to its core reasoning skills, Ling-1T possesses specialized abilities in front-end code generation—combining semantic understanding with visual aesthetics—and exhibits emergent agentic capabilities, such as proficient tool use with minimal instruction tuning. Its primary use cases span software engineering, professional mathematics, complex logical reasoning, and agent-based workflows that demand a balance of high performance and efficiency.

Input type
Output Type
Input$0.56/M tokens
Output$2.24/M tokens
Context128.00K
Max Output32.00K
13.55M tokens

Gemini 2.5 Flash Image, a.k.a. "Nano Banana," is now generally available. It is a state of the art image generation model with contextual understanding. It is capable of image generation, edits, and multi-turn conversations.

Input type
Output Type
Input$0.30/M tokens
Output$2.50/M tokens
Context32.77K
Max Output8.19K
OpenAI: GPT-5 Pro

OpenAI: GPT-5 Pro

openai/gpt-5-pro
24.25M tokens

GPT-5 Pro is OpenAI’s most advanced model, offering major improvements in reasoning, code quality, and user experience. It is optimized for complex tasks that require step-by-step reasoning, instruction following, and accuracy in high-stakes use cases. It supports test-time routing features and advanced prompt understanding, including user-specified intent like "think hard about this." Improvements include reductions in hallucination, sycophancy, and better performance in coding, writing, and health-related tasks.

Input type
Output Type
Input$15.00/M tokens
Output$120.00/M tokens
Context400.00K
Max Output128.00K
Z.AI: GLM 4.6

Z.AI: GLM 4.6

z-ai/glm-4.6
425.00M tokens

GLM-4.6 is a flagship model from Zhishen with 355B total parameters and 32B active parameters. The model's context window has been expanded from 128K to 200K, enabling it to handle longer code and agent tasks. In programming capabilities, GLM-4.6's performance is comparable to Claude Sonnet 4 on public benchmarks and real-world programming tasks. The model supports tool calling during inference and features optimized search and tool-use performance within agent frameworks. Furthermore, enhancements have been made to its writing style, readability, role-playing, and cross-lingual task processing abilities.

Pricing: The model's official pricing is (GLM-4.6: Input $0.6, Cached Input $0.11, Output $2.2). Our platform is currently running a limited-time promotion, during which you will receive a discount on the official price.

Input type
Output Type
Input$0.35 - 0.56/M tokens
Output$1.54 - 2.25/M tokens
Context200.00K
Max Output128.00K
Anthropic: Claude Sonnet 4.5

Anthropic: Claude Sonnet 4.5

anthropic/claude-sonnet-4.5
14485.89M tokens

Claude Sonnet 4.5 is Anthropic’s most advanced Sonnet model to date, optimized for real-world agents and coding workflows. It delivers state-of-the-art performance on coding benchmarks such as SWE-bench Verified, with improvements across system design, code security, and specification adherence. The model is designed for extended autonomous operation, maintaining task continuity across sessions and providing fact-based progress tracking.

Sonnet 4.5 also introduces stronger agentic capabilities, including improved tool orchestration, speculative parallel execution, and more efficient context and memory management. With enhanced context tracking and awareness of token usage across tool calls, it is particularly well-suited for multi-context and long-running workflows. Use cases span software engineering, cybersecurity, financial analysis, research agents, and other domains requiring sustained reasoning and tool use.

Input type
Output Type
Input$3.00 - 6.00/M tokens
Output$15.00 - 22.50/M tokens
Context200.00K
Max Output64.00K
DeepSeek: DeepSeek-V3.2-Exp

DeepSeek: DeepSeek-V3.2-Exp

deepseek/deepseek-v3.2-exp
32.87M tokens

DeepSeek-V3.2-Exp is an experimental model version, serving as an intermediate step toward the next-generation architecture. Built on the foundation of V3.1-Terminus, it introduces the DeepSeek Sparse Attention (DSA) mechanism—a sparse attention mechanism designed to explore and validate the optimization of training and inference efficiency in long-context scenarios. This experimental version represents the team's continuous research on more efficient Transformer architectures, with a specific focus on improving computational efficiency when processing long text sequences. For the first time, DSA enables fine-grained sparse attention, which significantly enhances the efficiency of long-context training and inference while maintaining almost unchanged model output quality.

Input type
Output Type
Input$0.22/M tokens
Output$0.33/M tokens
Context163.84K
Max Output65.54K
OpenAI: GPT-5 Codex

OpenAI: GPT-5 Codex

openai/gpt-5-codex
215.83M tokens

GPT-5-Codex is a specialized version of GPT-5 optimized for software engineering and coding workflows. It is designed for both interactive development sessions and long, independent execution of complex engineering tasks. The model supports building projects from scratch, feature development, debugging, large-scale refactoring, and code review. Compared to GPT-5, Codex is more steerable, adheres closely to developer instructions, and produces cleaner, higher-quality code outputs. Reasoning effort can be adjusted with the reasoning.effort parameter. Read the docs here

Codex integrates into developer environments including the CLI, IDE extensions, GitHub, and cloud tasks. It adapts reasoning effort dynamically—providing fast responses for small tasks while sustaining extended multi-hour runs for large projects. The model is trained to perform structured code reviews, catching critical flaws by reasoning over dependencies and validating behavior against tests. It also supports multimodal inputs such as images or screenshots for UI development and integrates tool use for search, dependency installation, and environment setup. Codex is intended specifically for agentic coding applications.

Input type
Output Type
Input$1.25/M tokens
Output$10.00/M tokens
Context400.00K
Max Output128.00K
Qwen: Qwen3-Max

Qwen: Qwen3-Max

qwen/qwen3-max
250.88M tokens

Qwen3-Max is an updated release built on the Qwen3 series, offering major improvements in reasoning, instruction following, multilingual support, and long-tail knowledge coverage compared to the January 2025 version. It delivers higher accuracy in math, coding, logic, and science tasks, follows complex instructions in Chinese and English more reliably, reduces hallucinations, and produces higher-quality responses for open-ended Q&A, writing, and conversation. The model supports over 100 languages with stronger translation and commonsense reasoning, and is optimized for retrieval-augmented generation (RAG) and tool calling, though it does not include a dedicated “thinking” mode.

Input type
Output Type
Input$1.20 - 3.00/M tokens
Output$6.00 - 15.00/M tokens
Context256.00K
Max Output32.00K
Qwen: Qwen3-VL-Plus

Qwen: Qwen3-VL-Plus

qwen/qwen3-vl-plus
45.53M tokens

The Qwen3 series VL models effectively integrates thinking and non-thinking modes, achieving world-leading performance in visual agent capabilities on public benchmark datasets such as OS World. This version features comprehensive upgrades in areas like visual coding, spatial perception, and multimodal reasoning, significantly enhancing visual perception and recognition abilities, and supporting the understanding of ultra-long videos.

Input type
Output Type
Input$0.20 - 0.60/M tokens
Output$1.60 - 4.80/M tokens
Context262.14K
Max Output32.77K
xAI: Grok 4 Fast

xAI: Grok 4 Fast

x-ai/grok-4-fast
1090.81M tokens

Grok 4 Fast is xAI's latest multimodal model with SOTA cost-efficiency and a 2M token context window. It comes in two flavors: non-reasoning and reasoning. Read more about the model on xAI's news post. Reasoning can be enabled using the reasoning enabled parameter in the API. Learn more in our docs

Prompts and completions may be used by xAI or OpenRouter to improve future models.

Input type
Output Type
Input$0.20 - 0.40/M tokens
Output$0.50 - 1.00/M tokens
Context2.00M
Max Output30.00K
xAI: Grok 4 Fast None Reasoning

xAI: Grok 4 Fast None Reasoning

x-ai/grok-4-fast-non-reasoning
1016.89M tokens

Grok 4 Fast is xAI's latest multimodal model with SOTA cost-efficiency and a 2M token context window. It comes in two flavors: non-reasoning and reasoning. Read more about the model on xAI's news post. Reasoning can be enabled using the reasoning enabled parameter in the API. Learn more in our docs

Prompts and completions may be used by xAI or OpenRouter to improve future models.

Input type
Output Type
Input$0.20 - 0.40/M tokens
Output$0.50 - 1.00/M tokens
Context2.00M
Max Output30.00K
inclusionAI: Ling-flash-2.0

inclusionAI: Ling-flash-2.0

inclusionai/ling-flash-2.0
116.14M tokens

Ling-flash-2.0 is an open-source Mixture-of-Experts (MoE) language model developed under the Ling 2.0 architecture. It features 100 billion total parameters, with 6.1 billion activated during inference (4.8B non-embedding).

Trained on over 20 trillion tokens and refined with supervised fine-tuning and multi-stage reinforcement learning, the model demonstrates strong performance against dense models up to 40B parameters. It excels in complex reasoning, code generation, and frontend development.

Input type
Output Type
Input$0.28/M tokens
Output$2.80/M tokens
Context128.00K
Max Output32.00K
inclusionAI: Ring-flash-2.0

inclusionAI: Ring-flash-2.0

inclusionai/ring-flash-2.0
198.62M tokens

Trained on over 20 trillion tokens and refined with supervised fine-tuning and multi-stage reinforcement learning, the model demonstrates strong performance against dense models up to 40B parameters. It excels in complex reasoning, code generation, and frontend development.

Input type
Output Type
Input$0.28/M tokens
Output$2.80/M tokens
Context128.00K
Max Output32.00K
Baidu: ERNIE-X1.1-Preview

Baidu: ERNIE-X1.1-Preview

baidu/ernie-x1.1-preview
821.28K tokens

The Wenxin Large Model X1.1 delivers significantly enhanced performance in question answering, tool invocation, agent capabilities, instruction following, logical reasoning, mathematical problem-solving, and coding tasks, with markedly improved factual accuracy. Its context window has been extended to 64K tokens, enabling longer inputs and dialogue histories, while maintaining response speed and improving the coherence of long-chain reasoning.

Input type
Output Type
Input$0.14/M tokens
Output$0.56/M tokens
Context65.54K
Max Output65.54K
MoonshotAI: Kimi K2 0905

MoonshotAI: Kimi K2 0905

moonshotai/kimi-k2-0905
84.75M tokens

Kimi K2 0905 is the September update of Kimi K2 0711 [blocked]. It is a large-scale Mixture-of-Experts (MoE) language model developed by Moonshot AI, featuring 1 trillion total parameters with 32 billion active per forward pass. It supports long-context inference up to 256k tokens, extended from the previous 128k.

This update improves agentic coding with higher accuracy and better generalization across scaffolds, and enhances frontend coding with more aesthetic and functional outputs for web, 3D, and related tasks. Kimi K2 is optimized for agentic capabilities, including advanced tool use, reasoning, and code synthesis. It excels across coding (LiveCodeBench, SWE-bench), reasoning (ZebraLogic, GPQA), and tool-use (Tau2, AceBench) benchmarks. The model is trained with a novel stack incorporating the MuonClip optimizer for stable large-scale MoE training.

Input type
Output Type
Input$0.60/M tokens
Output$2.50/M tokens
Context262.10K
Max Output262.10K
inclusionAI: Ling-mini-2.0

inclusionAI: Ling-mini-2.0

inclusionai/ling-mini-2.0
117.95M tokens

Ling-mini-2.0 is an open-source Mixture-of-Experts (MoE) large language model designed to balance strong task performance with high inference efficiency. It has 16B total parameters, with approximately 1.4B activated per token (about 789M non-embedding). Trained on over 20T tokens and refined via multi-stage supervised fine-tuning and reinforcement learning, it is reported to deliver strong results in complex reasoning and instruction following while keeping computational costs low. According to the upstream release, it reaches top-tier performance among sub-10B dense LLMs and in some cases matches or surpasses larger MoE models.

Input type
Output Type
Input$0.07/M tokens
Output$0.28/M tokens
Context128.00K
Max Output32.00K
inclusionAI: Ring-mini-2.0

inclusionAI: Ring-mini-2.0

inclusionai/ring-mini-2.0
257.60M tokens

Ring-mini-2.0 is a Mixture-of-Experts (MoE) model oriented toward high-throughput inference and extensively optimized on the Ling 2.0 architecture. It uses 16B total parameters with approximately 1.4B activated per token and is reported to deliver comprehensive reasoning performance comparable to sub-10B dense LLMs. The model shows strong results on logical reasoning, code generation, and mathematical tasks, supports 128K context windows, and reports generation speeds of 300+ tokens per second.

Input type
Output Type
Input$0.07/M tokens
Output$0.70/M tokens
Context128.00K
Max Output32.00K
xAI: Grok Code Fast 1

xAI: Grok Code Fast 1

x-ai/grok-code-fast-1
619.50M tokens

Grok Code Fast 1 is a speedy and economical reasoning model that excels at agentic coding. With reasoning traces visible in the response, developers can steer Grok Code for high-quality work flows.

Input type
Output Type
Input$0.20/M tokens
Output$1.50/M tokens
Context256.00K
Max Output10.00K
DeepSeek: DeepSeek V3.1

DeepSeek: DeepSeek V3.1

deepseek/deepseek-chat-v3.1
1286.17M tokens

DeepSeek-V3.1 is a large hybrid reasoning model (671B parameters, 37B active) that supports both thinking and non-thinking modes via prompt templates. It extends the DeepSeek-V3 base with a two-phase long-context training process, reaching up to 128K tokens, and uses FP8 microscaling for efficient inference.

The model improves tool use, code generation, and reasoning efficiency, achieving performance comparable to DeepSeek-R1 on difficult benchmarks while responding more quickly. It supports structured tool calling, code agents, and search agents, making it suitable for research, coding, and agentic workflows.

It succeeds the DeepSeek V3-0324 model and performs well on a variety of tasks.

Input type
Output Type
Input$0.28/M tokens
Output$1.11/M tokens
Context128.00K
Max Output65.54K
VolcanoEngine: Doubao-Seed-1.6-vision

VolcanoEngine: Doubao-Seed-1.6-vision

volcengine/doubao-seed-1-6-vision
55.08M tokens

Doubao 1.6 Visual Deep Reasoning Model, featuring stronger general multimodal understanding and reasoning capabilities

Input type
Output Type
Input$0.11 - 0.34/M tokens
Output$1.13 - 3.39/M tokens
Context256.00K
Max Output32.00K
OpenAI: GPT-5

OpenAI: GPT-5

openai/gpt-5
1180.51M tokens

GPT-5 is OpenAI’s most advanced model, offering major improvements in reasoning, code quality, and user experience. It is optimized for complex tasks that require step-by-step reasoning, instruction following, and accuracy in high-stakes use cases. It supports test-time routing features and advanced prompt understanding, including user-specified intent like "think hard about this." Improvements include reductions in hallucination, sycophancy, and better performance in coding, writing, and health-related tasks.

Input type
Output Type
Input$1.25/M tokens
Output$10.00/M tokens
Context400.00K
Max Output128.00K
OpenAI: GPT-5 Chat

OpenAI: GPT-5 Chat

openai/gpt-5-chat
61.00M tokens

GPT-5 Chat is designed for advanced, natural, multimodal, and context-aware conversations for enterprise applications.

Input type
Output Type
Input$1.25/M tokens
Output$10.00/M tokens
Context128.00K
Max Output16.38K
OpenAI: GPT-5 Mini

OpenAI: GPT-5 Mini

openai/gpt-5-mini
412.40M tokens

GPT-5 Mini is a compact version of GPT-5, designed to handle lighter-weight reasoning tasks. It provides the same instruction-following and safety-tuning benefits as GPT-5, but with reduced latency and cost. GPT-5 Mini is the successor to OpenAI's o4-mini model.

Input type
Output Type
Input$0.25/M tokens
Output$2.00/M tokens
Context400.00K
Max Output128.00K
OpenAI: GPT-5 Nano

OpenAI: GPT-5 Nano

openai/gpt-5-nano
275.67M tokens

GPT-5-Nano is the smallest and fastest variant in the GPT-5 system, optimized for developer tools, rapid interactions, and ultra-low latency environments. While limited in reasoning depth compared to its larger counterparts, it retains key instruction-following and safety features. It is the successor to GPT-4.1-nano and offers a lightweight option for cost-sensitive or real-time applications.

Input type
Output Type
Input$0.05/M tokens
Output$0.40/M tokens
Context400.00K
Max Output128.00K
Anthropic: Claude Opus 4.1

Anthropic: Claude Opus 4.1

anthropic/claude-opus-4.1
77.90M tokens

Claude Opus 4.1 is an updated version of Anthropic’s flagship model, offering improved performance in coding, reasoning, and agentic tasks. It achieves 74.5% on SWE-bench Verified and shows notable gains in multi-file code refactoring, debugging precision, and detail-oriented reasoning. The model supports extended thinking up to 64K tokens and is optimized for tasks involving research, data analysis, and tool-assisted reasoning.

Input type
Output Type
Input$15.00/M tokens
Output$75.00/M tokens
Context200.00K
Max Output32.00K
StepFun: Step-3

StepFun: Step-3

stepfun/step-3
599.79K tokens

Step-3 is a brand-new multimodal reasoning model that can process both image and text inputs and produce text responses. It is capable of deep thinking and can autonomously carry out a reasoning process. That is, before generating the final output, it completes a “thinking” phase (for example, presenting reasoning information via a reasoning field), which improves the accuracy of the final result and the depth of reasoning. When calling the model, developers do not need to preset too many system prompts (sys_prompt), as the model can automatically leverage its built-in deep-thinking capability._

Input type
Output Type
Input$0.21 - 0.57/M tokens
Output$0.57 - 1.42/M tokens
Context65.54K
Max Output65.54K
Z.AI: GLM 4.5

Z.AI: GLM 4.5

z-ai/glm-4.5
52.14M tokens

GLM-4.5 is our latest flagship foundation model, purpose-built for agent-based applications. It leverages a Mixture-of-Experts (MoE) architecture and supports a context length of up to 128k tokens. GLM-4.5 delivers significantly enhanced capabilities in reasoning, code generation, and agent alignment. It supports a hybrid inference mode with two options, a "thinking mode" designed for complex reasoning and tool use, and a "non-thinking mode" optimized for instant responses. Users can control the reasoning behaviour with the reasoning enabled boolean.

Input type
Output Type
Input$0.35 - 0.56/M tokens
Output$1.54 - 2.25/M tokens
Context128.00K
Max Output96.00K
Z.AI: GLM 4.5 Air

Z.AI: GLM 4.5 Air

z-ai/glm-4.5-air
121.95M tokens

GLM-4.5-Air is the lightweight variant of our latest flagship model family, also purpose-built for agent-centric applications. Like GLM-4.5, it adopts the Mixture-of-Experts (MoE) architecture but with a more compact parameter size. GLM-4.5-Air also supports hybrid inference modes, offering a "thinking mode" for advanced reasoning and tool use, and a "non-thinking mode" for real-time interaction. Users can control the reasoning behaviour with the reasoning enabled boolean. Learn more in our docs

Input type
Output Type
Input$0.11 - 0.17/M tokens
Output$0.56 - 1.12/M tokens
Context128.00K
Max Output96.00K
Qwen: Qwen3-Coder-Plus

Qwen: Qwen3-Coder-Plus

qwen/qwen3-coder-plus
281.86M tokens

Powered by Qwen3, this is a powerful Coding Agent that excels in tool calling and environment interaction to achieve autonomous programming. It combines outstanding coding proficiency with versatile general-purpose abilities.

Input type
Output Type
Input$1.00 - 6.00/M tokens
Output$5.00 - 60.00/M tokens
Context1000.00K
Max Output65.54K
Google: Gemini 2.5 Flash Lite

Google: Gemini 2.5 Flash Lite

google/gemini-2.5-flash-lite
1330.76M tokens

Gemini 2.5 Flash-Lite is a lightweight reasoning model in the Gemini 2.5 family, optimized for ultra-low latency and cost efficiency. It offers improved throughput, faster token generation, and better performance across common benchmarks compared to earlier Flash models. By default, "thinking" (i.e. multi-pass reasoning) is disabled to prioritize speed, but developers can enable it via the Reasoning API parameter to selectively trade off cost for intelligence.

Input type
Output Type
Input$0.10/M tokens
Output$0.40/M tokens
Context1.05M
Max Output65.53K
Qwen: Qwen3 235B A22B Instruct 2507

Qwen: Qwen3 235B A22B Instruct 2507

qwen/qwen3-235b-a22b-2507
21.14M tokens

Qwen3-235B-A22B-Instruct-2507 is a multilingual, instruction-tuned mixture-of-experts language model based on the Qwen3-235B architecture, with 22B active parameters per forward pass. It is optimized for general-purpose text generation, including instruction following, logical reasoning, math, code, and tool usage. The model supports a native 262K context length and does not implement "thinking mode" (<think> blocks).

Compared to its base variant, this version delivers significant gains in knowledge coverage, long-context reasoning, coding benchmarks, and alignment with open-ended tasks. It is particularly strong on multilingual understanding, math reasoning (e.g., AIME, HMMT), and alignment evaluations like Arena-Hard and WritingBench.

Input type
Output Type
Input$0.28/M tokens
Output$1.11/M tokens
Context256.00K
Max Output128.00K
Qwen: Qwen3 235B A22B Thinking 2507

Qwen: Qwen3 235B A22B Thinking 2507

qwen/qwen3-235b-a22b-thinking-2507
8.37M tokens

Qwen3-235B-A22B-Thinking-2507 is a high-performance, open-weight Mixture-of-Experts (MoE) language model optimized for complex reasoning tasks. It activates 22B of its 235B parameters per forward pass and natively supports up to 262,144 tokens of context. This "thinking-only" variant enhances structured logical reasoning, mathematics, science, and long-form generation, showing strong benchmark performance across AIME, SuperGPQA, LiveCodeBench, and MMLU-Redux. It enforces a special reasoning mode (</think>) and is designed for high-token outputs (up to 81,920 tokens) in challenging domains.

The model is instruction-tuned and excels at step-by-step reasoning, tool use, agentic workflows, and multilingual tasks. This release represents the most capable open-source variant in the Qwen3-235B series, surpassing many closed models in structured reasoning use cases.

Input type
Output Type
Input$0.28/M tokens
Output$2.78/M tokens
Context256.00K
Max Output128.00K
Qwen: Qwen3-Coder

Qwen: Qwen3-Coder

qwen/qwen3-coder
20.70M tokens

Qwen3-Coder-480B-A35B-Instruct is a Mixture-of-Experts (MoE) code generation model developed by the Qwen team. It is optimized for agentic coding tasks such as function calling, tool use, and long-context reasoning over repositories. The model features 480 billion total parameters, with 35 billion active per forward pass (8 out of 160 experts).

Pricing for the Alibaba endpoints varies by context length. Once a request is greater than 128k input tokens, the higher pricing is used.

Input type
Output Type
Input$1.25/M tokens
Output$5.01/M tokens
Context256.00K
Max Output128.00K
MoonshotAI: Kimi K2 0711

MoonshotAI: Kimi K2 0711

moonshotai/kimi-k2-0711
10.98M tokens

Kimi K2 Instruct is a large-scale Mixture-of-Experts (MoE) language model developed by Moonshot AI, featuring 1 trillion total parameters with 32 billion active per forward pass. It is optimized for agentic capabilities, including advanced tool use, reasoning, and code synthesis. Kimi K2 excels across a broad range of benchmarks, particularly in coding (LiveCodeBench, SWE-bench), reasoning (ZebraLogic, GPQA), and tool-use (Tau2, AceBench) tasks. It supports long-context inference up to 128K tokens and is designed with a novel training stack that includes the MuonClip optimizer for stable large-scale MoE training.

Input type
Output Type
Input$0.56/M tokens
Output$2.23/M tokens
Context128.00K
Max Output32.00K
xAI: Grok 4

xAI: Grok 4

x-ai/grok-4
82.79M tokens

Grok 4 is xAI's latest reasoning model with a 256k context window. It supports parallel tool calling, structured outputs, and both image and text inputs. Note that reasoning is not exposed, reasoning cannot be disabled, and the reasoning effort cannot be specified. Pricing increases once the total tokens in a given request is greater than 128k tokens. See more details on the xAI docs

Input type
Output Type
Input$3.00 - 6.00/M tokens
Output$15.00 - 30.00/M tokens
Context256.00K
Max Output256.00K
Google: Gemini 2.5 Flash

Google: Gemini 2.5 Flash

google/gemini-2.5-flash
468.87M tokens

Gemini 2.5 Flash is Google's state-of-the-art workhorse model, specifically designed for advanced reasoning, coding, mathematics, and scientific tasks. It includes built-in "thinking" capabilities, enabling it to provide responses with greater accuracy and nuanced context handling.

Additionally, Gemini 2.5 Flash is configurable through the "max tokens for reasoning" parameter, as described in the documentation.

Input type
Output Type
Input$0.30/M tokens
Output$2.50/M tokens
Context1.05M
Max Output65.53K
Google: Gemini 2.5 Pro

Google: Gemini 2.5 Pro

google/gemini-2.5-pro
487.09M tokens

Gemini 2.5 Pro is Google’s state-of-the-art AI model designed for advanced reasoning, coding, mathematics, and scientific tasks. It employs “thinking” capabilities, enabling it to reason through responses with enhanced accuracy and nuanced context handling. Gemini 2.5 Pro achieves top-tier performance on multiple benchmarks, including first-place positioning on the LMArena leaderboard, reflecting superior human-preference alignment and complex problem-solving abilities.

Input type
Output Type
Input$1.25 - 2.50/M tokens
Output$10.00 - 15.00/M tokens
Context1.05M
Max Output65.53K
DeepSeek: R1 0528

DeepSeek: R1 0528

deepseek/deepseek-r1-0528
218.12M tokens

May 28th update to the original DeepSeek R1 Performance on par with OpenAI o1, but open-sourced and with fully open reasoning tokens. It's 671B parameters in size, with 37B active in an inference pass. Fully open-source model.

Input type
Output Type
Input$0.56/M tokens
Output$2.23/M tokens
Context64.00K
Max Output64.00K
Anthropic: Claude Opus 4

Anthropic: Claude Opus 4

anthropic/claude-opus-4
57.76M tokens

Claude Opus 4 is benchmarked as the world’s best coding model, at time of release, bringing sustained performance on complex, long-running tasks and agent workflows. It sets new benchmarks in software engineering, achieving leading results on SWE-bench (72.5%) and Terminal-bench (43.2%). Opus 4 supports extended, agentic workflows, handling thousands of task steps continuously for hours without degradation.

Input type
Output Type
Input$15.00/M tokens
Output$75.00/M tokens
Context200.00K
Max Output32.00K
Anthropic: Claude Sonnet 4

Anthropic: Claude Sonnet 4

anthropic/claude-sonnet-4
4618.43M tokens

Claude Sonnet 4 significantly enhances the capabilities of its predecessor, Sonnet 3.7, excelling in both coding and reasoning tasks with improved precision and controllability. Achieving state-of-the-art performance on SWE-bench (72.7%), Sonnet 4 balances capability and computational efficiency, making it suitable for a broad range of applications from routine coding tasks to complex software development projects. Key enhancements include improved autonomous codebase navigation, reduced error rates in agent-driven workflows, and increased reliability in following intricate instructions. Sonnet 4 is optimized for practical everyday use, providing advanced reasoning capabilities while maintaining efficiency and responsiveness in diverse internal and external scenarios.

Input type
Output Type
Input$3.00 - 6.00/M tokens
Output$15.00 - 22.50/M tokens
Context1000.00K
Max Output64.00K
Google: Gemma 3 12B

Google: Gemma 3 12B

google/gemma-3-12b-it
2.83M tokens

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3 12B is the second largest in the family of Gemma 3 models after Gemma 3 27B [blocked]

Input type
Output Type
Input$0.02/M tokens
Output$0.10/M tokens
Context128.00K
Max Output128.00K
Qwen: Qwen3 14B

Qwen: Qwen3 14B

qwen/qwen3-14b
14.36M tokens

Qwen3-14B is a dense 14.8B parameter causal language model from the Qwen3 series, designed for both complex reasoning and efficient dialogue. It supports seamless switching between a "thinking" mode for tasks like math, programming, and logical inference, and a "non-thinking" mode for general-purpose conversation. The model is fine-tuned for instruction-following, agent tool use, creative writing, and multilingual tasks across 100+ languages and dialects. It natively handles 32K token contexts and can extend to 131K tokens using YaRN-based scaling.

Input type
Output Type
Input$0.14/M tokens
Output$1.40/M tokens
Context32.00K
Max Output32.00K
OpenAI: o4 Mini

OpenAI: o4 Mini

openai/o4-mini
110.78M tokens

OpenAI o4-mini is a compact reasoning model in the o-series, optimized for fast, cost-efficient performance while retaining strong multimodal and agentic capabilities. It supports tool use and demonstrates competitive reasoning and coding performance across benchmarks like AIME (99.5% with Python) and SWE-bench, outperforming its predecessor o3-mini and even approaching o3 in some domains.

Despite its smaller size, o4-mini exhibits high accuracy in STEM tasks, visual problem solving (e.g., MathVista, MMMU), and code editing. It is especially well-suited for high-throughput scenarios where latency or cost is critical. Thanks to its efficient architecture and refined reinforcement learning training, o4-mini can chain tools, generate structured outputs, and solve multi-step tasks with minimal delay—often in under a minute.

Input type
Output Type
Input$1.10/M tokens
Output$4.40/M tokens
Context200.00K
Max Output100.00K
OpenAI: GPT-4.1

OpenAI: GPT-4.1

openai/gpt-4.1
896.98M tokens

GPT-4.1 is a flagship large language model optimized for advanced instruction following, real-world software engineering, and long-context reasoning. It supports a 1 million token context window and outperforms GPT-4o and GPT-4.5 across coding (54.6% SWE-bench Verified), instruction compliance (87.4% IFEval), and multimodal understanding benchmarks. It is tuned for precise code diffs, agent reliability, and high recall in large document contexts, making it ideal for agents, IDE tooling, and enterprise knowledge retrieval.

Input type
Output Type
Input$2.00/M tokens
Output$8.00/M tokens
Context1.05M
Max Output32.77K
OpenAI: GPT-4.1 Mini

OpenAI: GPT-4.1 Mini

openai/gpt-4.1-mini
1139.64M tokens

GPT-4.1 Mini is a mid-sized model delivering performance competitive with GPT-4o at substantially lower latency and cost. It retains a 1 million token context window and scores 45.1% on hard instruction evals, 35.8% on MultiChallenge, and 84.1% on IFEval. Mini also shows strong coding ability (e.g., 31.6% on Aider’s polyglot diff benchmark) and vision understanding, making it suitable for interactive applications with tight performance constraints.

Input type
Output Type
Input$0.40/M tokens
Output$1.60/M tokens
Context1.05M
Max Output32.77K
OpenAI: GPT-4.1 Nano

OpenAI: GPT-4.1 Nano

openai/gpt-4.1-nano
239.98M tokens

For tasks that demand low latency, GPT‑4.1 nano is the fastest and cheapest model in the GPT-4.1 series. It delivers exceptional performance at a small size with its 1 million token context window, and scores 80.1% on MMLU, 50.3% on GPQA, and 9.8% on Aider polyglot coding – even higher than GPT‑4o mini. It’s ideal for tasks like classification or autocompletion.

Input type
Output Type
Input$0.10/M tokens
Output$0.40/M tokens
Context1.05M
Max Output32.77K
Meta: Llama 4 Scout Instruct

Meta: Llama 4 Scout Instruct

meta/llama-4-scout-17b-16e-instruct
99.11K tokens

Llama 4 Scout 17B Instruct (16E) is a mixture-of-experts (MoE) language model developed by Meta, activating 17 billion parameters out of a total of 109B. It supports native multimodal input (text and image) and multilingual output (text and code) across 12 supported languages. Designed for assistant-style interaction and visual reasoning, Scout uses 16 experts per forward pass and features a context length of 10 million tokens, with a training corpus of ~40 trillion tokens. Built for high efficiency and local or commercial deployment, Llama 4 Scout incorporates early fusion for seamless modality integration. It is instruction-tuned for use in multilingual chat, captioning, and image understanding tasks. Released under the Llama 4 Community License, it was last trained on data up to August 2024 and launched publicly on April 5, 2025.

Input type
Output Type
Input$0.08/M tokens
Output$0.40/M tokens
Context-
Max Output-
Google: Gemini 2.0 Flash Lite

Google: Gemini 2.0 Flash Lite

google/gemini-2.0-flash-lite-001
87.11M tokens

Gemini 2.0 Flash Lite offers a significantly faster time to first token (TTFT) compared to Gemini Flash 1.5, while maintaining quality on par with larger models like Gemini Pro 1.5, all at extremely economical token prices.

Input type
Output Type
Input$0.07/M tokens
Output$0.30/M tokens
Context1.05M
Max Output8.19K
Anthropic: Claude 3.7 Sonnet

Anthropic: Claude 3.7 Sonnet

anthropic/claude-3.7-sonnet
2794.89M tokens

Claude 3.7 Sonnet is an advanced large language model with improved reasoning, coding, and problem-solving capabilities. It introduces a hybrid reasoning approach, allowing users to choose between rapid responses and extended, step-by-step processing for complex tasks. The model demonstrates notable improvements in coding, particularly in front-end development and full-stack updates, and excels in agentic workflows, where it can autonomously navigate multi-step processes.

Claude 3.7 Sonnet maintains performance parity with its predecessor in standard mode while offering an extended reasoning mode for enhanced accuracy in math, coding, and instruction-following tasks.

Input type
Output Type
Input$3.00/M tokens
Output$15.00/M tokens
Context200.00K
Max Output64.00K
Google: Gemini 2.0 Flash

Google: Gemini 2.0 Flash

google/gemini-2.0-flash
32.30M tokens

Gemini Flash 2.0 offers a significantly faster time to first token (TTFT) compared to Gemini Flash 1.5, while maintaining quality on par with larger models like Gemini Pro 1.5. It introduces notable enhancements in multimodal understanding, coding capabilities, complex instruction following, and function calling. These advancements come together to deliver more seamless and robust agentic experiences.

Input type
Output Type
Input$0.15/M tokens
Output$0.60/M tokens
Context1.05M
Max Output8.19K
Meta: Llama 3.3 70B Instruct

Meta: Llama 3.3 70B Instruct

meta/llama-3.3-70b-instruct
192.38M tokens

The Meta Llama 3.3 multilingual large language model (LLM) is a pretrained and instruction tuned generative model in 70B (text in/text out). The Llama 3.3 instruction tuned text only model is optimized for multilingual dialogue use cases and outperforms many of the available open source and closed chat models on common industry benchmarks.

Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

Input type
Output Type
Input$0.60/M tokens
Output$1.20/M tokens
Context-
Max Output-
Anthropic: Claude 3.5 Haiku

Anthropic: Claude 3.5 Haiku

anthropic/claude-3.5-haiku
14.93M tokens

Claude 3.5 Haiku features offers enhanced capabilities in speed, coding accuracy, and tool use. Engineered to excel in real-time applications, it delivers quick response times that are essential for dynamic tasks such as chat interactions and immediate coding suggestions.

This makes it highly suitable for environments that demand both speed and precision, such as software development, customer service bots, and data management systems.

Input type
Output Type
Input$0.80/M tokens
Output$4.00/M tokens
Context200.00K
Max Output8.19K
18.89M tokens

Retiring soon; the model has fallen back to Sonnet 3.7.

New Claude 3.5 Sonnet delivers better-than-Opus capabilities, faster-than-Sonnet speeds, at the same Sonnet prices. Sonnet is particularly good at:

  • Coding: Scores ~49% on SWE-Bench Verified, higher than the last best score, and without any fancy prompt scaffolding
  • Data science: Augments human data science expertise; navigates unstructured data while using multiple tools for insights
  • Visual processing: excelling at interpreting charts, graphs, and images, accurately transcribing text to derive insights beyond just the text alone
  • Agentic tasks: exceptional tool use, making it great at agentic tasks (i.e. complex, multi-step problem solving tasks that require engaging with other systems)
Input type
Output Type
Input$3.00/M tokens
Output$15.00/M tokens
Context200.00K
Max Output8.19K
OpenAI: GPT-4o-mini

OpenAI: GPT-4o-mini

openai/gpt-4o-mini
801.07M tokens

GPT-4o mini is OpenAI's newest model after GPT-4 Omni, supporting both text and image inputs with text outputs.

As their most advanced small model, it is many multiples more affordable than other recent frontier models, and more than 60% cheaper than GPT-3.5 Turbo. It maintains SOTA intelligence, while being significantly more cost-effective.

GPT-4o mini achieves an 82% score on MMLU and presently ranks higher than GPT-4 on chat preferences common leaderboards.

Input type
Output Type
Input$0.15/M tokens
Output$0.60/M tokens
Context128.00K
Max Output16.38K
OpenAI: GPT-4o

OpenAI: GPT-4o

openai/gpt-4o
164.84M tokens

GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs with text outputs. It maintains the intelligence level of GPT-4 Turbowhile being twice as fast and 50% more cost-effective. GPT-4o also offers improved performance in processing non-English languages and enhanced visual capabilities.

For benchmarking against other models, it was briefly called "im-also-a-good-gpt2-chatbot"

Input type
Output Type
Input$2.50/M tokens
Output$10.00/M tokens
Context128.00K
Max Output16.38K