Stories by SiliconFlow on Medium

GLM-4.7 Now on SiliconFlow: Advanced Coding, Reasoning & Tool Use Capabilities

SiliconFlow — Thu, 25 Dec 2025 14:22:21 GMT

We’re excited to announce that GLM-4.7, Z.ai’s latest flagship model, is now available on SiliconFlow with Day 0 support. Compared with its predecessor GLM-4.6, this release brings significant advancements across coding, complex reasoning, and tool utilization — delivering performance that rivals or even outperforms industry leaders like Claude Sonnet 4.5 and GPT-5.1.

Currently, SiliconFlow supports the entire GLM model series, including GLM-4.5, GLM-4.5-Air, GLM-4.5V, GLM-4.6, GLM-4.6V, and now GLM-4.7.

SiliconFlow Day 0 support with:

Competitive Pricing: GLM-4.7 $0.6/M tokens (input) and $2.2/M tokens (output)
205K Context Window: Tackle complex coding tasks, deep document analysis, and extended agentic workflows.
Anthropic & OpenAI-Compatible APIs: Deploy via SiliconFlow with seamless integration into Claude Code, Kilo Code, Cline, Roo Code, and other mainstream agent workflows with significant improvements on complex tasks.

What Makes GLM-4.7 Special

GLM-4.7, your new coding partner, is coming with the following features:

Core Coding Excellence

GLM-4.7 sets a new standard for multilingual agentic coding and terminal-based tasks. Compared to its predecessor, the improvements are substantial:

73.8% (+5.8%) on SWE-bench Verified
66.7% (+12.9%) on SWE-bench Multilingual
41% (+16.5%) on Terminal Bench 2.0

The model now supports “thinking before acting” enabling more reliable performance on complex tasks across mainstream agent frameworks including Claude Code, Kilo Code, Cline, and Roo Code.

Vibe Coding

GLM-4.7 takes a major leap forward in UI quality. It produces cleaner, more modern webpages and generates better-looking slides with more accurate layout and sizing. Whether you’re prototyping interfaces or creating presentations, the visual output quality is noticeably enhanced.

Advanced Tool Using

Tool utilization has been significantly enhanced. On multi-step benchmarks like τ²-Bench and web browsing tasks via BrowseComp, GLM-4.7 surpasses both Claude Sonnet 4.5 and GPT-5.1 High, demonstrating superior capability for complex, real-world workflows.

Complex Reasoning Capabilities

Mathematical and reasoning abilities see a substantial boost, with GLM-4.7 achieving 42.8% (+12.4%) on the HLE (Humanity’s Last Exam) benchmark compared to GLM-4.6. Moreover, you can also see significant improvements in many other scenarios such as chat, creative writing, and role-play scenarios.

Whether it’s coding, creativity, or complex reasoning — get started now to see what GLM-4.7 brings to your workflow.

Get Started Immediately

Explore: Try GLM-4.7 in the SiliconFlow playground.
Integrate: Use our OpenAI/Anthropic-compatible API. Explore the full API specifications in the SiliconFlow API documentation.

import requests
url = "https://api.siliconflow.com/v1/chat/completions"
payload = {
    "model": "zai-org/GLM-4.7",
    "messages": [
        {
            "role": "system",
            "content": "You are an assistant"
        },
        {
            "role": "user",
            "content": "What's the weather like in America?"
        }
    ],
    "stream": True,
    "max_tokens": 4096,
    "enable_thinking": True,
    "temperature": 1,
    "top_p": 0.95
}
headers = {
    "Authorization": "Bearer ",
    "Content-Type": "application/json"
}
response = requests.post(url, json=payload, headers=headers)
print(response.text)

GLM-4.6V Now on SiliconFlow: Native Multimodal Tool Use Meets SoTA Visual Intelligence

SiliconFlow — Thu, 18 Dec 2025 14:22:26 GMT

TL;DR: GLM-4.6V, Z.ai’s latest multimodal large language model, is now available on SiliconFlow. Featuring a 131K multimodal context window and native function calling integration, it delivers SoTA performance in visual understanding and reasoning — seamlessly bridging the gap between “visual perception” and “executable action”. The GLM-4.6V series provides a unified technical foundation for multimodal agents in real-world business scenarios. Try GLM-4.6V now and level up your multimodal agents with SiliconFlow APIs.

We are thrilled to announce GLM-4.6V, Z.ai’s latest multimodal foundation model designed for cloud and enterprise-grade scenarios, is now available on SiliconFlow. It integrates native multimodal function calling capability and excels in long-context visual reasoning, directly closing the loop from perception to understanding to execution.

Now, through SiliconFlow’s GLM-4.6V API, you can expect:

Budget-friendly Pricing: GLM-4.6V $0.30/M tokens (input) and $0.90/M tokens (output)
131K Context Window: Enables processing lengthy industry reports, extensive slide decks, or long-form video content
Seamless Integration: Instantly deploy via SiliconFlow’s OpenAI-compatible API, or plug into your existing agentic frameworks, automation tools, or workflows.

Whether you are building agents, workflows, or tools for:

Rich-Text Content Creation: Convert papers, reports, and slides into polished posts for social media and knowledge bases
Design-to-Code Automation: Upload screenshots/designs for pixel-level HTML/CSS/JS code generation
Business Document Processing: Process reports to extract metrics and synthesize comparative tables
Video Content Operations: Summarize, tag, and extract insights at scale

Through SiliconFlow’s production-ready API, you can leverage GLM-4.6V to power your multimodal agents in minutes — no cost concerns, no engineering overhead.

Let’s dive into the key capabilities with live demos from the SiliconFlow Platform.

Key Features & Benchmark Performance

In most LLM pipelines, tool calling is still text-only: even for image or document tasks, everything must be converted into text first, then back again. This process potentially leads to information loss and increases system complexity. GLM-4.6V changes this with native multimodal tool calling capability:

Multimodal Input: Images, UI screenshots, and document pages can be passed directly as tool arguments, avoiding manual text conversion and preserving layout and visual cues.
Multimodal Output: The model can directly interpret tool results such as search pages, charts, rendered web screenshots, or product images, and feed them back into its reasoning and final response.

By closing the loop from perception → understanding → execution, GLM-4.6V supports the following key features:

Rich-Text Content Understanding and Creation: Accurately understands complex text, charts, tables, and formulas, then autonomously invokes visual tools to crop key visuals during generation, and audits image quality to compose publication-ready content perfect for social media & knowledge bases.
Visual Web Search: Recognizes search intent and autonomously triggers appropriate search tools, then comprehends and aligns the mixed visual-textual results to identify relevant information, and finally performs reasoning to deliver structured, visually-rich answers.
Frontend Replication & Visual Interaction: Achieves pixel-level replication by identifying layouts, components, and color schemes from screenshots to generate high-fidelity HTML/CSS/JS code, then lets you refine it interactively — just circle an element and tell it what you want, like “make this button bigger and change it to green.”
Long-Context Understanding: Processes ~150 pages of documents, 200 slides, or a one-hour video in a single pass with its 131K context window, enabling tasks like analyzing financial reports or summarizing an entire football match while pinpointing specific goal events and timestamps.

GLM-4.6V has also been evaluated across 20+ mainstream multimodal benchmarks including MMBench, MathVista, and OCRBench, achieving SoTA performance among open-source models. It matches or outperforms comparable-scale models like Qwen3-VL-235B, Kimi-VL-A3B-Thinking-2506, and Step3–321B in key capabilities: multimodal understanding, multimodal agentic tasks, and long-context processing.

Techniques

GLM-4.6V sets the technical foundation for multimodal agents in real-world business scenarios. To achieve this performance, GLM-4.6V introduces a comprehensive suite of innovations:

Model architecture & long-sequence modeling: GLM-4.6V is continually pre-trained on long-context image–text data, with visual–language compression alignment (inspired by Glyph) to better couple visual encoding with linguistic semantics.
Multimodal world knowledge: A billion-scale multimodal perception and world-knowledge corpus was introduced to enhance both basic visual understanding and the accuracy and completeness of cross-modal QA.
Agentic data & MCP extensions: Through large-scale synthetic agentic training, GLM-4.6V extends Model Context Protocol (MCP) with URL-based multimodal handling and end-to-end interleaved text–image output using a “Draft → Image Selection → Final Polish” workflow.
RL for multimodal agents: Tool-calling behaviors are integrated into a unified RL objective, and a visual feedback loop (building on UI2Code^N) lets the model use rendered results to self-correct its code and actions, pushing toward self-improving multimodal agents.

Get Started Immediately

Explore: Try GLM-4.6V in the SiliconFlow playground.
Integrate: Use our OpenAI-compatible API. Explore the full API specifications in the SiliconFlow API documentation.

import requests
url = "https://api.siliconflow.com/v1/chat/completions"
payload = {
    "model": "zai-org/GLM-4.6V",
    "messages": [
        {
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "detail": "auto",
                        "url": "https://tse4.mm.bing.net/th/id/OIP.mDDGH4uc_a7tmLFLJvKXrQHaEo?rs=1&pid=ImgDetMain&o=7&rm=3"
                    }
                },
                {
                    "type": "text",
                    "text": "What is in the picture?"
                }
            ],
            "role": "user"
        }
    ],
    "stream": True,
    "temperature": 1
}
headers = {
    "Authorization": "Bearer ",
    "Content-Type": "application/json"
}
response = requests.request("POST", url, json=payload, headers=headers)
print(response.text)

Z-Image-Turbo Now on SiliconFlow: Photorealistic & Bilingual Text Rendering

SiliconFlow — Wed, 17 Dec 2025 14:16:48 GMT

Today, Z-Image-Turbo — Alibaba Tongyi’s latest lightweight 6B-parameter text-to-image model — is now available on SiliconFlow. Through systematic optimization and a Single-Stream Diffusion Transformer architecture, it delivers photorealistic image generation and bilingual text rendering on par with leading commercial models, proving that top-tier performance doesn’t require massive model sizes.

Whether you’re building creative tools, marketing assets, or visual AI applications, Z-Image-Turbo delivers the speed and precision to bring your workflow to the next level.

With SiliconFlow’s Z-Image-Turbo API, you can expect:

Budget-Friendly Pricing: Z-Image-Turbo at just $0.005/image.
Extreme Efficiency: As a distilled model, it delivers top-tier performance in only 8 steps, matching or exceeding leading competitors.
Photorealistic & Bilingual: Excels in both photorealistic image generation and accurate English & Chinese text rendering, with robust adherence to complex instructions.
SOTA Performance: Powered by a Single-Stream Diffusion Transformer architecture, it achieves state-of-the-art results among open-source models on the Alibaba AI Arena (Elo-based evaluation).

Key Capabilities & Real-world Performance

Unlike traditional foundation models that rely on massive parameters for quality or struggle with specific cultural nuances, Z-Image redefines efficiency and is designed to support:

Efficient Photorealistic Quality

Z-Image-Turbo excels at producing images with photography-level realism, demonstrating fine control over details, lighting, and textures. It balances high fidelity with strong aesthetic quality in composition and overall mood.

As shown in the examples below, the model handles complex visual phenomena with remarkable accuracy — from the intricate light refraction inside ice cubes, to lifelike human features, to the subtle sheen and flowing folds of silk fabric.

All images were generated using Z-Image-Turbo on the SiliconFlow platform

Excellent Bilingual Text Rendering

It can also accurately render English and Chinese text while preserving facial realism and overall aesthetic composition, with results comparable to top-tier closed-source models. In poster design, it demonstrates strong compositional skills and a good sense of typography. It can render high-quality text even in challenging scenarios with small font sizes, delivering designs that are both textually precise and visually compelling

As shown in the posters generated with Z-Image-Turbo on the SiliconFlow platform, the model renders text with impressive clarity and style, delivering layouts that combine accurate typography with strong artistic aesthetics across editorial, realistic and cartoon-like designs.

Rich World Knowledge and Cultural Understanding

Z-Image possesses a vast understanding of world knowledge and diverse cultural concepts. This allows it to accurately generate a wide array of subjects, including famous landmarks, well-known characters, and specific real-world objects.

As demonstrated in our examples, the model captures cultural elements such as the costumes and atmosphere of the Venice Carnival, iconic objects like the Venetian gondola, as well as world-famous landmarks like the Eiffel Tower — all with impressive accuracy and stylistic fidelity.

Get Started Immediately

Explore: Try Z-Image in the SiliconFlow playground.
Integrate: Use our OpenAI-compatible API. Explore the full API specifications in the SiliconFlow API documentation.

import requests
url = "https://api.siliconflow.com/v1/images/generations"
payload = {
    "model": "Tongyi-MAI/Z-Image-Turbo",
    "prompt": "A small, adorable green frog with big round eyes gently swims through the clear blue ocean water. Sunlight beams down from above, creating shimmering ripples on the frog’s skin and the sandy ocean floor. The frog paddles its tiny legs gracefully, leaving soft trails of bubbles behind. Colorful tropical fish and coral reefs surround it, adding a vibrant and lively atmosphere. The overall style is bright, whimsical, and cinematic, with smooth, fluid motion and a playful, heartwarming mood.",
    "image_size": "1024x1024",
    "seed": 1,
}
headers = {
    "Authorization": "Bearer ",
    "Content-Type": "application/json"
}
response = requests.post(url, json=payload, headers=headers)
print(response.text)

DeepSeek-V3.2 Now on SiliconFlow: Reasoning-first model built for agents

SiliconFlow — Tue, 16 Dec 2025 14:36:39 GMT

TL;DR: DeepSeek-V3.2 (official version of V3.2-Exp) is now live on SiliconFlow. As a reasoning-first model built for agents, it combines high efficiency with GPT-5-level reasoning performance and a 164K context window. It also features tool-use capabilities in thinking mode, validated across 85K+ complex instructions and 1,800+ environments. Start building today with SiliconFlow’s API to supercharge your agentic workflows.

We are thrilled to unlock access to DeepSeek’s latest model on SiliconFlow, DeepSeek-V3.2, a new series that harmonizes computational efficiency with superior reasoning and agentic performance. As the first DeepSeek model to integrate thinking directly into tool-use, DeepSeek-V3.2 delivers GPT-5 level reasoning with significantly shorter outputs.

Meanwhile, DeepSeek-V3.2-Speciale pushes open-source boundaries of theorem proving and coding to rival Gemini 3 Pro. Together, they set a new benchmark for developers building next-generation AI agents.

Now, through SiliconFlow’s DeepSeek-V3.2 API, you can expect:

Cost-effective Pricing:

DeepSeek-V3.2 $0.27/M tokens (input) and $0.42/M tokens (output)

DeepSeek-V3.2-Speciale is coming soon and stay tuned for first-hand updates

164K Context Window: Perfect for long documents, complex multi-turn conversations, and extended agentic tasks.
Seamless Integration: Instantly deploy via SiliconFlow’s OpenAI-compatible API, or plug into your existing stack through Claude Code, Gen-CLI and Cline.

Whether you’re building agents, coding assistants, or complex reasoning pipelines, SiliconFlow’s DeepSeek-V3.2 API delivers the performance you need at a fraction of the expected cost and latency.

Why it matters

For developers building agents, multi-step reasoning pipelines or any AI system that needs to think and act, the DeepSeek-V3.2 series finally delivers the combination the industry has been waiting for: frontier-grade reasoning, integrated tool-use during thinking, and real-world efficiency:

World-Leading Reasoning Capabilities
DeepSeek-V3.2: The Efficient “Daily Driver” for Agents

Engineered to strike the perfect balance between reasoning capabilities and output length, DeepSeek-V3.2 is your go-to choice for production workflows, such as advanced Q&A and general agent tasks.

Performance: Delivers reasoning capabilities on par with GPT-5.
Efficiency: Compared to Kimi-K2-Thinking, V3.2 has significantly shorter output lengths, translating to lower computational overhead and reduced overall generation time.
DeepSeek-V3.2-Speciale: Maxed-out reasoning capabilities (Research Preview)

As the enhanced long-thinking variant of V3.2, V3.2-Speciale aims to push the boundaries of open-source reasoning capabilities, integrating the theorem-proving capabilities of DeepSeek-Math-V2.

Gold-Medal Performance: V3.2-Speciale attains gold-level results in IMO, CMO, ICPC World Finals & IOI 2025.
Benchmarks: It excels in complex instruction following, rigorous mathematical reasoning and logical verification, effectively rivaling Gemini 3 Pro on mainstream reasoning leaderboards.
Thinking in Tool-Use

DeepSeek-V3.2 breaks the barrier between “reasoning” and “acting.” Unlike previous versions where tool usage was restricted during the thinking process, DeepSeek-V3.2 is the first to seamlessly integrate thinking directly into tool-use, supporting tool invocation in both Thinking and Non-Thinking modes.

To deliver this level of agentic reliability, DeepSeek introduces a massive-scale training synthesis method:

Robust Generalization: The model was forged through “hard-to-solve, easy-to-verify” reinforcement learning tasks.
Extensive Coverage: Training spanned 1,800+ distinct environments and over 85,000+ complex instructions, significantly enhancing the model’s generalization and instruction-following capability in the agent context.

What makes it powerful

DeepSeek-V3.2 series’ performance is enabled by three core technical breakthroughs:

DeepSeek Sparse Attention (DSA):

To tackle the challenge of long-context processing, the model introduces DeepSeek Sparse Attention (DSA). This efficient attention mechanism substantially reduces computational complexity without compromising performance, specifically optimized for long-context scenarios.

Scalable Reinforcement Learning:

DeepSeek-V3.2 leverages a robust Reinforcement Learning (RL) protocol combined with scaled post-training compute. This advanced training framework is the key driver behind the model’s exceptional reasoning capabilities.

Large-Scale Agentic Task Synthesis Pipeline:

DeepSeek has revolutionized agent capability through a novel Large-Scale Agentic Task Synthesis Pipeline. By systematically generating training data at scale, the model integrates reasoning directly into tool-use scenarios. This results in superior compliance and generalization, ensuring that your agents can reliably navigate complex, multi-step interactive environments with precision.

Developer-Ready Integration

Beyond DeepSeek-V3.2’s industry-leading agentic performance, SiliconFlow delivers instant compatibility with your existing development ecosystem:

OpenAI-Compatible Tools: Seamless integration with Cline, Qwen Code, Gen-CLI, and other standard development environments — just plug in your SiliconFlow API key.
Anthropic-Compatible API: Works with Claude Code and any Anthropic-compatible tools for code reviews, debugging, and architectural refactoring.
Platform Integrations: Ready-to-use in Dify, ChatHub, Chatbox, Sider, MindSearch, DB-GPT, and also available through OpenRouter.

With powerful models, seamless integrations, and competitive pricing, SiliconFlow transforms how you build — letting you ship faster and scale smarter.

Get Started Immediately

Explore: Try DeepSeek-V3.2 in the SiliconFlow playground.
Integrate: Use our OpenAI-compatible API. Explore the full API specifications in the SiliconFlow API documentation.

import requests
url = "https://api.siliconflow.com/v1/chat/completions"
payload = {
    "model": "deepseek-ai/DeepSeek-V3.2",
    "messages": [
        {
            "role": "user",
            "content": "an island near sea, with seagulls, moon shining over the sea, light house, boats int he background, fish flying over the sea"
        }
    ],
    "stream": True,
    "max_tokens": 4096,
    "enable_thinking": False,
    "thinking_budget": 4096,
    "min_p": 0,
    "stop": "1",
    "temperature": 0.7,
    "top_p": 0.7,
    "top_k": 50,
    "frequency_penalty": 0.5,
    "n": 1,
    "response_format": { "type": "json_object" },
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "",
                "description": "",
                "parameters": {},
                "strict": False
            }
        }
    ]
}
headers = {
    "Authorization": "Bearer ",
    "Content-Type": "application/json"
}
response = requests.post(url, json=payload, headers=headers)
print(response.text)

Kimi K2 Thinking Now on SiliconFlow: Thinking Agent That Reasons and Acts

SiliconFlow — Mon, 24 Nov 2025 14:03:00 GMT

TL;DR: Kimi K2 Thinking is now available on SiliconFlow, Moonshot AI’s latest and most advanced open-source thinking model. Designed as a reasoning agent, it thinks step by step and can execute up to 200–300 sequential tool calls without human interference, reasoning coherently across hundreds of steps to solve complex problems. It excels in reasoning, agentic search, coding, writing and general capabilities. Get started with Kimi K2 Thinking on SiliconFlow with OpenAI/Anthropic-compatible APIs for seamless integration into your agents and workflows.

We’re excited to welcome Kimi K2 Thinking, Moonshot AI’s most advanced open-source thinking model now available on SiliconFlow. Unlike traditional reasoning models that only think, it reasons and acts, autonomously chaining up to 300 tool calls — search, code, data tools — to solve complex problems end-to-end. This marks Moonshot’s breakthrough in test-time scaling: simultaneously extending both reasoning depth and agentic capabilities to unlock new levels of problem-solving power.

With SiliconFlow’s Kimi K2 Thinking API, you can expect:

Budget-friendly Pricing: Kimi K2 Thinking $1.1/M tokens (input) and $4.5/M tokens (output).
262K Context Window: Perfect for long documents, complex reasoning, and extended agentic tasks.
Outperforms GPT-5 & Claude Sonnet 4.5: across key reasoning, coding, and agent benchmarks.

Whether you’re building reasoning agents, coding copilots, or research assistants, Kimi K2 Thinking is now accessible through SiliconFlow’s OpenAI/Anthropic-compatible API — ready to plug into your existing workflows.

Key Features

The Kimi K2 Thinking now available on SiliconFlow features the following key capabilities:

Deep Thinking & Tool Orchestration: End-to-end trained to interleave chain-of-thought reasoning with function calls, enabling autonomous research, coding, and writing workflows that last hundreds of steps without drift. For example, when building interactive visual simulations, it coordinates reasoning with tool calls to convert high-level instructions into runnable code — greatly improving automation and reliability in complex development tasks.
Production-Ready Speed: Native INT4 quantization achieves 2x inference speed with no quality loss — important when you’re running tasks that involve hundreds of operations.
Reliable Over Long Sessions: Handles 200–300 sequential consecutive actions through adaptive reasoning cycles: Plan → Reason → Execute → Adapt → Refine. Unlike typical models that lose focus after 30–50 steps, it decomposes complex problems into clear subtasks and completes end-to-end workflows.
Strong General Writing: Handles creative, analytical, and personalized writing with coherent logic, vivid detail, and empathetic tone — adapting smoothly across styles without losing quality.

Benchmark Performance

Kimi K2 Thinking sets new records across benchmarks assessing reasoning, coding, and agent capabilities, outperforming leading models like GPT-5 and Claude Sonnet 4.5:

Agentic Reasoning: Achieves 44.9% on HLE, a rigorous benchmark of thousands of expert-level questions across 100+ subjects.
Agentic Coding: Scores 71.3% on SWE-Bench Verified and 61.1% on SWE-Multilingual, showcasing strong generalization across programming languages and agent scaffolds. Also delivers notable improvements on HTML, React, and component-intensive front-end tasks.
Agentic Search and Browsing: Reaches 60.2% on BrowseComp, double the human baseline of 29.2%.

Developer-Ready Integration

Beyond Kimi K2 Thinking’s industry-leading performance, SiliconFlow delivers instant compatibility with your existing development ecosystem:

OpenAI-Compatible Tools: Seamless integration with Cline, Qwen Code, Gen-CLI, and other standard development environments — just plug in your SiliconFlow API key.
Anthropic-Compatible API: Works with Claude Code and any Anthropic-compatible tools for code reviews, debugging, and architectural refactoring.
Platform Integrations: Ready-to-use in Dify, ChatHub, Chatbox, Sider, MindSearch, DB-GPT, and also available through OpenRouter.

With powerful models, seamless integrations, and cost-effective pricing, SiliconFlow transforms how you build — letting you ship faster and scale smarter.

Get Started Immediately

Explore: Try Kimi K2 Thinking in the SiliconFlow Playground.
Integrate: Use our OpenAI-compatible API. Explore the full API specifications in the SiliconFlow API documentation.

import requests

url = "https://api.siliconflow.com/v1/chat/completions"

payload = {
    "model": "moonshotai/Kimi-K2-Thinking",
    "messages": [
        {
            "role": "user",
            "content": "Please provide information about a person in the following JSON format: {   \"name\": \"string\",   \"age\": \"number\",   \"occupation\": \"string\",   \"hobbies\": [\"string\"] }  Generate a realistic example."
        }
    ],
    "max_tokens": 4096,
    "stop": "1",
    "temperature": 0.7,
    "response_format": {"type": "json_object"}
}
headers = {
    "Authorization": "Bearer ",
    "Content-Type": "application/json"
}

response = requests.request("POST", url, json=payload, headers=headers)

print(response.text)

8 Key Insights on AI Infra from the co-founder of SiliconFlow

SiliconFlow — Wed, 12 Nov 2025 14:11:02 GMT

Pan Yang, co-founder of SiliconFlow, delivered a speech entitled “AI Infra: For Whom and Why?” at “Real-Time AI Infra Session” of Convo AI & RTE 2025. There are 8 core insights into the field of AI Infra.

TL;DR

8 key insights from Pan Yang’s speech on AI Infrastructure:

Inference First — The shift toward inference computing is driven by exponential growth in AI customers and computation needs.
Open-Source Opportunities — Open-source models catching up with 3–5 month gap, with breakthrough potential in multimodal areas.
The Calling for MaaS — One-stop platforms providing single API access to multiple models.
Three Major MaaS Challenges — Availability issues, performance variations, and the cost reduction illusion.
Do the Difficult but Right Thing — SiliconFlow’s commitment to delivering faster, better, and more cost-effective AI Infra.
Four AI Scenarios 2025 — Content generation, Agentic AI (Year of Agent), Coding, and Multimodal applications.
AI is Work, Not Tool — Jensen Huang’s paradigm shift emphasizing building for Agents rather than humans.
AI Infra — No Bubble— The market reality is showing massive unfulfilled demand proving there’s no bubble, only supply shortage.

Inference first

SiliconFlow predicted that “in the future, the vast majority of computing power will be used for inference, rather than training” in 2023. This trend is becoming a reality in 2025, mainly driven by two factors: the exponential growth in the number and usage of AI customers, and the exponential growth in the amount of computation required to complete a single task.

The opportunities of open-source models

Open-source models are rapidly catching up with closed-source models at a dynamic gap of 3–5 months. Currently, the open-source ecosystem for LLMs is close to state-of-the-art (SOTA), while for multimodal models such as image, audio and video, there are still significant opportunities for breakthroughs.

The calling for Model as a Service (MaaS)

This year, we witnessed frequent model updates, diverse specifications, varied architectures, and multiple modalities, no single company can independently deploy and maintain all models. Therefore, a one-stop MaaS platform capable of integrating various models has become an indispensable entry point for developers. This is precisely the direction that SiliconFlow continues to focus on, allowing users to quickly experience various models with just one API.

MaaS platforms currently face three major challenges

Availability and reliability challenges: Issues such as insufficient resources and 429/503 errors have occurred.
Performance and quality vary significantly: the same open-source model provided by different service providers exhibits significant differences in actual performance, reflecting the varying levels of model quantization and optimization, which directly affect the model’s final capabilities.
The illusion of decreasing costs: Although the cost of a single model may decrease tenfold annually, users always seek the latest and most powerful state-of-the-art (SOTA) models, while the invocation prices of these top-tier models remain relatively stable. Meanwhile, the number of tokens consumed to complete a task increases exponentially, resulting in no significant decrease in actual application costs.

Do the difficult but right thing

SiliconFlow has always been deeply rooted in the AI Infra field, deeply understanding the challenges involved, and continuously committed to promoting the implementation of solutions to provide users with faster, better-performing, and lower-cost AI Infra services.

Four highly consensus AI scenarios by 2025

Content generation: generating an article, providing customer service by chatbot, or building a knowledge base, everything revolves around language.
Agentic AI: This year has been called the year of Agent. Although there are various understandings of the concept of Agent, there have been some changes. For example, Manus has made great efforts to promote how to define Agent.
Coding: The first thing the mainstream models released this year did was to align with Agent and Coding capabilities. The industry generally agrees that Agent and Coding are the areas that consume the most tokens.
Multimodal: Especially in the Chinese Internet environment, the model consumption of multimodal is far greater than that of other forms.

“AI is Work, Not Tool”

Jensen Huang proposed that “AI is Work, Not Tool”, which is essentially a paradigm shift. AI will proactively operate tools to complete tasks, rather than passively responding to instructions. This will trigger a paradigm shift: building for agents, rather than for humans. Humans will increasingly delegate tasks to agents, operating less directly on software interfaces.

AI Infra — No Bubble

The entire AI infrastructure industry is free of bubbles, and it is actually in a state of “far from insufficient” supply. The global top technology companies have planned to purchase infrastructure worth hundreds of billions of dollars that have not yet been delivered. The current bottlenecks in the industry are the inability to produce chips and the lack of energy. Demand far exceeds supply capacity, proving the market’s authenticity and enormous potential.

MiniMax-M2 Now on SiliconFlow: Frontier-Style Coding and Agentic Intelligence

SiliconFlow — Tue, 11 Nov 2025 09:14:07 GMT

TL;DR: MiniMax-M2, the latest open-source MoE model from the MiniMax AI, is now available on SiliconFlow. With 230B total parameters and 10B active, it delivers frontier-level reasoning, coding, and agentic performance in a compact, efficient form. M2 strikes the perfect balance between intelligence, speed, and cost, achieving top benchmark results while offering fast inference and affordable pricing through SiliconFlow’s API. Try MiniMax-M2 on SiliconFlow — explore frontier-grade intelligence at a fraction of the cost.

SiliconFlow is excited to introduce MiniMax-M2, a compact yet powerful model designed for advanced coding and agentic workflows, now available on our platform. It is a compact and efficient MoE model (230B total parameters with 10B active) designed for strong performance in coding and agentic tasks while maintaining robust general intelligence. With 10B active parameters, it delivers advanced reasoning and tool-use capabilities comparable to larger models.

Through SiliconFlow’s MiniMax-M2 API, you can expect:

Budget-friendly Pricing: MiniMax-M2 $0.3/M tokens (input) and $1.2/M tokens (output).
192K Context Window: Perfect for long documents, complex reasoning, and extended agentic tasks.
Proven Real-World Performance: Ranked #1 among open-source models on Artificial Analysis benchmarks, excelling in math, science, instruction following, coding, and agentic tasks.

Key Features & Benchmark Performance

In today’s fast-moving era of intelligent Agents, most teams still face a familiar dilemma: no single model truly balances performance, cost, and speed. Top-tier models deliver frontier-level results but are expensive and slow, while lighter alternatives are affordable yet limited in reasoning depth and responsiveness.

MiniMax M2 is positioned to break this trade-off, delivering frontier-level coding, tool-use, and reasoning with fast inference and exceptional cost efficiency. Based on SiliconFlow’s API pricing, running M2 costs around 92% less than Claude Sonnet 4.5 — while delivering comparable coding and reasoning capabilities.

Superior Intelligence: According to benchmarks from Artificial Analysis, MiniMax-M2 demonstrates highly competitive general intelligence across mathematics, science, instruction following, coding, and agentic tool use. Its composite score ranks #1 among open-source models globally.
Advanced Coding: Engineered for end-to-end developer workflows, MiniMax-M2 excels at multi-file edits, coding-run-fix loops, and test-validated repairs. Strong performance on Terminal-Bench and (Multi-)SWE-Bench–style tasks demonstrates practical effectiveness in terminals, IDEs, and CI across languages.
Agent Performance: Plans and executes long-horizon toolchains across shell, browser, retrieval, and code runners. In BrowseComp-style evaluations, it consistently locates hard-to-surface sources, maintains evidence traceable, and gracefully recovers from flaky steps.

Use SiliconFlow’s MiniMax-M2 API

Let’s take a look at MiniMax-M2 in action — running on SiliconFlow’s API via Claude Code, tackling a real-world coding task:

“Create a space-themed brick breaker game in React using HTML5 Canvas. Spaceship paddle moves with arrow keys, glowing ball bounces to destroy alien bricks. Include dark starry background, 5 rows of colorful bricks, score display, and 3 lives. Game over when lives run out, win when all bricks cleared. Add simple particle effects when bricks break and use Vite for setup.”

Claude Code

Now, you can easily integrate SiliconFlow’s MiniMax-M2 API into Claude Code.

Step 1: Get Your SiliconFlow API Key

Log in to your SiliconFlow dashboard.
Navigate to API Keys section.
Generate a new API key for MiniMax-M2 access.
Copy and secure your API key.

Step 2: Configure Environment Variables

Open your terminal and set the following environment variables:

export ANTHROPIC_BASE_URL="https://api.siliconflow.com/v1/chat/completions"
export ANTHROPIC_MODEL="MiniMaxAI/MiniMax-M2"  # You can modify this to use other models as needed
export ANTHROPIC_API_KEY="YOUR_SILICONFLOW_API_KEY" # Please replace with your actual API Key

Step 3: Start Using Claude Code with MiniMax-M2

Navigate to your project directory and launch Claude Code:

cd your-project-directory
claude

Claude Code will now use MiniMax-M2 via SiliconFlow’s API service for all your coding assistance needs!

What’s more, you can also access SiliconFlow’s MiniMax-M2 model through gen-cli and Cline.

Gen-CLI

Gen-CLI is based on the open-source Gemini-CLI and is now available on GitHub. Install using the following steps:

Ensure your system has Node.js 18+ installed.
Set the API key environment variable:

export SILICONFLOW_API_KEY="YOUR_API_KEY"

Run Gen-CLI:

Via npx:

npx https://github.com/gen-cli/gen-cli

Or install via npm:

npm install -g @gen-cli/gen-cli
gen

Get Started Immediately

Explore: Try MiniMax-M2 in the SiliconFlow Playground.
Integrate: Use our OpenAI-compatible API. Explore the full API specifications in the SiliconFlow API documentation.

import requests

url = "https://api.siliconflow.com/v1/chat/completions"

payload = {
    "model": "MiniMaxAI/MiniMax-M2",
    "messages": [
        {
            "role": "user",
            "content": "Please provide information about a person in the following JSON format:"
        }
    ],
    "max_tokens": 4096,
    "stream": True,
    "enable_thinking": False,
    "temperature": 0.1,
    "response_format": {"type": "json_object"}
}
headers = {
    "Authorization": "Bearer ",
    "Content-Type": "application/json"
}

response = requests.request("POST", url, json=payload, headers=headers)

print(response.text)

Business or Sales Inquiries →

Join our Discord community now →

Explore all available models on SiliconFlow →

OneDiff 1.0 is out!

SiliconFlow — Fri, 19 Apr 2024 09:00:31 GMT

(With OneDiff, RTX 3090 can even surpass the performance of A100 GPUs, helping save costs on A100. )

OneDiff 1.0 is for Stable Diffusion and Stable Video Diffusion models(UNet/VAE/CLIP based) acceleration. We have got a lot of support/feedback from the community(https://github.com/siliconflow/onediff/wiki), big thanks!

The later version 2.0 will focus on DiT/Sora-like models.

OneDiff 1.0’s updates are mainly the issues in milestone 0.13, which includes the following new features and several bug fixes:

Improve performance of VAE

Quantize tools for enterprise edition

https://github.com/siliconflow/onediff/tree/main/src/onediff/quantization

https://github.com/siliconflow/onediff/blob/main/README_ENTERPRISE.md#onediff-enterprise

SD-WebUI supports offline quantized model

State-of-the-art performance

SDXL E2E time

Model stabilityai/stable-diffusion-xl-base-1.0
Image size 1024*1024, batch size 1, steps 30
NVIDIA A100 80G SXM4

SVD E2E time

Model stabilityai/stable-video-diffusion-img2vid-xt
Image size 576*1024, batch size 1, steps 25, decoder chunk size 5
NVIDIA A100 80G SXM4

More intro about onediff: https://github.com/siliconflow/onediff?tab=readme-ov-file#about-onediff

Looking forward to your feedback from SD community! Welcome to join OneDiff Discord group to discuss related questions.

OneDiff v0.12.1 is released(Stable acceleration of SD and SVD for production environment)

SiliconFlow — Fri, 08 Mar 2024 07:09:22 GMT

OneDiff v0.12.1 is now released! This update includes the following highlights, and welcome to install the new version for a better experience:

Here is the new Performance update:

SDXL E2E time

Model stabilityai/stable-diffusion-xl-base-1.0
Image size 1024*1024, batch size 1, steps 30
NVIDIA A100 80G SXM4

SVD E2E time

Model stabilityai/stable-video-diffusion-img2vid-xt
Image size 576*1024, batch size 1, steps 25, decoder chunk size 5
NVIDIA A100 80G SXM4

Furthermore, I would like to quote an awesome guide written by Felix Sanz for those who are interested in optimizing the SDXL:

“It(OneDiff) improves the visual quality of the result, almost halves inference time and the only penalty is a small wait at compilation time. What a great job!”

This report provided very comprehensive and clear analyses for SDXL inference engines. OneDiff was also surveyed. Enjoy: https://www.felixsanz.dev/articles/ultimate-guide-to-optimizing-stable-diffusion-xl

Check below for more details:

OneDiff 0.12.1 release log, https://github.com/siliconflow/onediff/releases/tag/0.12.1
OneDiff roadmap and feedbacks, https://github.com/siliconflow/onediff/wiki

Unlock the Potential of AI and Accelerate into the Future with Us!

SiliconFlow — Thu, 29 Feb 2024 10:30:33 GMT

As we step into the vibrant year of 2024, we cordially invite you to participate in a groundbreaking AI Inference Acceleration Challenge. Whether you’re a seasoned AI professional or just starting your journey, we’re eager to hear about your innovative applications and breakthroughs powered by OneDiff image/video generation inference engine.

GitHub：https://github.com/siliconflow/onediff

This platform offers you not only the chance to showcase your technical prowess but also the opportunity to win substantial prizes, including OneDiff Enterprise Edition licenses and exquisite gifts. Let’s work together to advance AI technology and witness how AI shapes our future.

Get ready to take action, compile your case, and join this AI extravaganza! For more details about the event, please refer to the poster below.

Welcome to join OneDiff Discord group to discuss related questions.