Support MThreads (MUSA) GPU #1162

yeahdongcn · 2026-01-02T04:25:39Z

This PR adds support for Moore Threads (MUSA) GPU platform, expanding LightLLM's hardware compatibility.

NOTE:

_fwd_kernel_token_att1 has been slightly updated to ensure compatibility with the Triton version.
has_mtlink will be used in upcoming enhancements to enable multi-GPU support.
torch / torch_musa need to be upgraded to the latest versions.

Testing Done

root@worker3218:/ws# python -m lightllm.server.api_server --model_dir /home/dist/Qwen3-0.6B/ --disable_cudagraph --host 0.0.0.0
WARNING 01-02 12:22:47 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3.         Try to upgrade it.
WARNING 01-02 12:22:47 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
INFO 01-02 12:22:48 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 01-02 12:22:48 [__init__.py:38] - musa -> vllm_musa:register
INFO 01-02 12:22:48 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 01-02 12:22:48 [__init__.py:232] Platform plugin musa is activated
WARNING 01-02 12:22:48 [vllm_utils.py:18] vllm is not installed, you can't use the api of it.                    You can solve it by running `pip install vllm`.
INFO 01-02 12:22:48 [communication_op.py:57] deep_ep is not installed, you can't use the api of it.
INFO 01-02 12:22:48 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On
WARNING 01-02 12:22:48 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm
WARNING 01-02 12:22:48 [nixl_kv_transporter.py:19] nixl is not installed, which is required for pd disagreggation!!!
INFO 01-02 12:22:48 [shm_size_check.py:21] SHM check: Available=500.00 GB,Recommended=2.32 GB.Sufficient: True
INFO 01-02 12:22:48 [api_start.py:94] zmq mode head: ipc:///tmp/_28765_0_
INFO 01-02 12:22:48 [api_start.py:96] use tgi api: False
INFO 01-02 12:22:48 [api_start.py:233] alloced ports: [10105, 10128, 10009, 10002, 10268, 10173, 10255, 10190, 10225, 10305]
INFO 01-02 12:22:48 [api_start.py:284] all start args:Namespace(run_mode='normal', host='0.0.0.0', port=8000, httpserver_workers=1, zmq_mode='ipc:///tmp/_28765_0_', pd_master_ip='0.0.0.0', pd_master_port=1212, pd_decode_rpyc_port=42000, select_p_d_node_strategy='round_robin', config_server_host=None, config_server_port=None, nixl_pd_kv_page_num=16, nixl_pd_kv_page_size=1024, model_name='default_model_name', model_dir='/home/dist/Qwen3-0.6B/', tokenizer_mode='fast', load_way='HF', max_total_token_num=None, mem_fraction=0.9, batch_max_tokens=8448, eos_id=[151645], tool_call_parser=None, reasoning_parser=None, chat_template=None, running_max_req_size=1000, nnodes=1, node_rank=0, multinode_httpmanager_port=12345, multinode_router_gloo_port=20001, tp=1, dp=1, dp_balancer='bs_balancer', max_req_total_len=16384, nccl_host='127.0.0.1', nccl_port=28765, use_config_server_to_init_nccl=False, mode=[], trust_remote_code=False, disable_log_stats=False, log_stats_interval=10, disable_shm_warning=False, router_token_ratio=0.0, router_max_new_token_len=1024, router_max_wait_tokens=1, disable_aggressive_schedule=False, use_dynamic_prompt_cache=False, disable_dynamic_prompt_cache=False, chunked_prefill_size=4096, disable_chunked_prefill=False, diverse_mode=False, token_healing_mode=False, output_constraint_mode='none', first_token_constraint_mode=False, enable_multimodal=False, enable_multimodal_audio=False, enable_mps=False, disable_custom_allreduce=False, enable_custom_allgather=False, enable_tpsp_mix_mode=False, enable_dp_prefill_balance=False, enable_prefill_microbatch_overlap=False, enable_decode_microbatch_overlap=False, enable_flashinfer_prefill=False, enable_flashinfer_decode=False, enable_fa3=False, cache_capacity=200, embed_cache_storage_size=4, data_type='bfloat16', return_all_prompt_logprobs=False, use_reward_model=False, long_truncation_mode=None, use_tgi_api=False, health_monitor=False, metric_gateway=None, job_name='lightllm', grouping_key=[], push_interval=10, visual_infer_batch_size=1, visual_send_batch_size=1, visual_gpu_ids=[0], visual_tp=1, visual_dp=1, visual_nccl_ports=[29500], enable_monitor_auth=False, disable_cudagraph=True, enable_prefill_cudagraph=False, prefll_cudagraph_max_handle_token=512, graph_max_batch_size=256, graph_split_batch_size=32, graph_grow_step_size=16, graph_max_len_in_batch=16384, quant_type='none', quant_cfg=None, vit_quant_type='none', vit_quant_cfg=None, sampling_backend='triton', penalty_counter_mode='gpu_counter', ep_redundancy_expert_config_path=None, auto_update_redundancy_expert=False, enable_fused_shared_experts=False, mtp_mode=None, mtp_draft_model_dir=None, mtp_step=0, kv_quant_calibration_config_path=None, schedule_time_interval=0.03, enable_cpu_cache=False, cpu_cache_storage_size=2, cpu_cache_token_page_size=256, enable_disk_cache=False, disk_cache_storage_size=10, disk_cache_dir=None, enable_dp_prompt_cache_fetch=False, router_port=10105, detokenization_port=10128, http_server_port=10009, visual_port=10002, audio_port=10268, cache_port=10173, metric_port=10255, multi_level_kv_cache_port=10190, pd_node_infer_rpyc_ports=[10305], pd_node_id=294623010895931863621527973304373176200, pd_p_allowed_port_min=20000, pd_p_allowed_port_max=30000)
WARNING 01-02 12:22:55 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3.         Try to upgrade it.
WARNING 01-02 12:22:55 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
INFO 01-02 12:22:55 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 01-02 12:22:55 [__init__.py:38] - musa -> vllm_musa:register
INFO 01-02 12:22:55 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 01-02 12:22:55 [__init__.py:232] Platform plugin musa is activated
WARNING 01-02 12:22:55 [vllm_utils.py:18] vllm is not installed, you can't use the api of it.                    You can solve it by running `pip install vllm`.
INFO 01-02 12:22:55 [communication_op.py:57] deep_ep is not installed, you can't use the api of it.
2026-01-02 12:22:55 | server | 140684395422848 | INFO : server started on [0.0.0.0]:10255
INFO 01-02 12:22:55 [start_utils.py:37] init func start_metric_manager : init ok
WARNING 01-02 12:23:02 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3.         Try to upgrade it.
WARNING 01-02 12:23:02 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
WARNING 01-02 12:23:02 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3.         Try to upgrade it.
WARNING 01-02 12:23:02 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
INFO 01-02 12:23:02 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 01-02 12:23:02 [__init__.py:38] - musa -> vllm_musa:register
INFO 01-02 12:23:02 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 01-02 12:23:02 [__init__.py:232] Platform plugin musa is activated
WARNING 01-02 12:23:02 [vllm_utils.py:18] vllm is not installed, you can't use the api of it.                    You can solve it by running `pip install vllm`.
INFO 01-02 12:23:02 [communication_op.py:57] deep_ep is not installed, you can't use the api of it.
INFO 01-02 12:23:02 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On
INFO 01-02 12:23:02 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 01-02 12:23:02 [__init__.py:38] - musa -> vllm_musa:register
INFO 01-02 12:23:02 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 01-02 12:23:02 [__init__.py:232] Platform plugin musa is activated
WARNING 01-02 12:23:02 [vllm_utils.py:18] vllm is not installed, you can't use the api of it.                    You can solve it by running `pip install vllm`.
INFO 01-02 12:23:02 [communication_op.py:57] deep_ep is not installed, you can't use the api of it.
WARNING 01-02 12:23:02 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm
INFO 01-02 12:23:02 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On
WARNING 01-02 12:23:03 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm
INFO 01-02 12:23:03 [manager.py:36] pub_to_httpserver sendhwm 1000
WARNING 01-02 12:23:03 [nixl_kv_transporter.py:19] nixl is not installed, which is required for pd disagreggation!!!
2026-01-02 12:23:03 | server | 140684395422848 | INFO : accepted ('127.0.0.1', 36414) with fd 25
2026-01-02 12:23:03 | server | 140653235951168 | INFO : welcome ('127.0.0.1', 36414)
INFO 01-02 12:23:08 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On
WARNING 01-02 12:23:09 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3.         Try to upgrade it.
INFO 01-02 12:23:10 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 01-02 12:23:10 [__init__.py:38] - musa -> vllm_musa:register
INFO 01-02 12:23:10 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 01-02 12:23:10 [__init__.py:232] Platform plugin musa is activated
WARNING 01-02 12:23:10 [vllm_utils.py:18] vllm is not installed, you can't use the api of it.                    You can solve it by running `pip install vllm`.
WARNING 01-02 12:23:10 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
WARNING 01-02 12:23:10 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm
INFO 01-02 12:23:10 [communication_op.py:57] deep_ep is not installed, you can't use the api of it.
WARNING 01-02 12:23:10 [nixl_kv_transporter.py:19] nixl is not installed, which is required for pd disagreggation!!!
INFO 01-02 12:23:10 [model_rpc.py:67] Initialized RPC server for rank 0.
INFO 01-02 12:23:10 [model_rpc.py:168] use ChunkedPrefillBackend
INFO 01-02 12:23:11 [basemodel.py:157] Initial quantization. The default quantization method is none
pid 39235 Loading model weights with 1 workers: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.01it/s]
INFO 01-02 12:23:12 [mem_utils.py:37] mode setting params: []
INFO 01-02 12:23:12 [mem_utils.py:57] Model kv cache using mode normal
INFO 01-02 12:23:12 [mem_manager.py:84] 69.38735313415528 GB space is available after load the model weight
INFO 01-02 12:23:12 [mem_manager.py:84] 0.109375 MB is the size of one token kv cache
INFO 01-02 12:23:12 [mem_manager.py:84] 649624 is the profiled max_total_token_num with the mem_fraction 0.9
INFO 01-02 12:23:12 [mem_manager.py:84] 
warming up:   0%|                                                                                                                                                                  | 0/12 [00:00<?, ?it/s]WARNING 01-02 12:23:23 [autotuner.py:169] No kernel config for silu_and_mul_fwd:v1 in {N=3072,out_dtype=torch.bfloat16}_MTT_S5000.json,the performance may be suboptimal!You can use LIGHTLLM_TRITON_AUTOTUNE_LEVEL=1 to enable autotune.
WARNING 01-02 12:23:23 [kernel_config.py:40] can not find config_path /ws/lightllm/common/all_kernel_configs/moe_silu_and_mul_kernel/{N=3072,out_dtype=torch.bfloat16}_MTT_S5000.json kernel name moe_silu_and_mul_kernel use default kernel setting
warming up: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:15<00:00,  1.29s/it]
INFO 01-02 12:23:30 [basemodel.py:812] begin check max_len infer
INFO 01-02 12:23:30 [basemodel.py:849] check max_len 8448 infer ok
INFO 01-02 12:23:45 [base_backend.py:185] loaded model class <class 'lightllm.models.qwen3.model.Qwen3TpPartModel'>
INFO 01-02 12:23:45 [manager.py:196] use req queue ChunkedPrefillQueue
INFO 01-02 12:23:45 [start_utils.py:37] init func start_router_process : init ok
INFO 01-02 12:23:45 [start_utils.py:37] init func start_detokenization_process : init ok
INFO 01-02 12:23:45 [api_start.py:58] start process pid 30307
INFO 01-02 12:23:45 [api_start.py:59] http server pid 54746
[2026-01-02 12:23:45 +0800] [54746] [INFO] Starting gunicorn 23.0.0
[2026-01-02 12:23:45 +0800] [54746] [INFO] Listening at: http://0.0.0.0:8000 (54746)
[2026-01-02 12:23:45 +0800] [54746] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2026-01-02 12:23:45 +0800] [54966] [INFO] Booting worker with pid: 54966
WARNING 01-02 12:23:51 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3.         Try to upgrade it.
WARNING 01-02 12:23:51 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
INFO 01-02 12:23:52 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 01-02 12:23:52 [__init__.py:38] - musa -> vllm_musa:register
INFO 01-02 12:23:52 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 01-02 12:23:52 [__init__.py:232] Platform plugin musa is activated
WARNING 01-02 12:23:52 [vllm_utils.py:18] vllm is not installed, you can't use the api of it.                    You can solve it by running `pip install vllm`.
INFO 01-02 12:23:52 [communication_op.py:57] deep_ep is not installed, you can't use the api of it.
INFO 01-02 12:23:52 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On
WARNING 01-02 12:23:52 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm
[2026-01-02 12:23:52 +0800] [54966] [INFO] Started server process [54966]
[2026-01-02 12:23:52 +0800] [54966] [INFO] Waiting for application startup.
INFO 01-02 12:23:52 [api_http.py:359] server start up
2026-01-02 12:23:53 | server | 140684395422848 | INFO : accepted ('127.0.0.1', 55128) with fd 26
2026-01-02 12:23:53 | server | 140653227558464 | INFO : welcome ('127.0.0.1', 55128)
2026-01-02 12:23:53 | server | 140684395422848 | INFO : accepted ('127.0.0.1', 55144) with fd 27
2026-01-02 12:23:53 | server | 140653219165760 | INFO : welcome ('127.0.0.1', 55144)
INFO 01-02 12:23:54 [req_id_generator.py:34] ReqIDGenerator init finished
INFO 01-02 12:23:54 [api_http.py:363] server start up ok, loop use is <uvloop.Loop running=True closed=False debug=False>
[2026-01-02 12:23:54 +0800] [54966] [INFO] Application startup complete.
INFO 01-02 12:23:58 [manager.py:417] recieved req X-Request-Id: X-Session-Id: start_time:2026-01-02 12:23:58 lightllm_req_id:8 
INFO 01-02 12:23:58 [manager.py:424] router recive req id 8 cost time 0.05271601676940918 s
DEBUG 01-02 12:23:58 [manager.py:322] Prefill Batch: batch_id=-1, time:1767327838.6764812s req_ids:[8] 
DEBUG 01-02 12:23:58 [manager.py:322] 
INFO 01-02 12:23:58 [manager.py:55] detokenization recv req id 8 cost time 0.0744318962097168 s
INFO 01-02 12:23:59 [manager.py:163] detoken release req id 8
INFO 01-02 12:23:59 [manager.py:611] X-Request-Id: X-Session-Id: start_time:2026-01-02 12:23:58 lightllm_req_id:8 first_token_cost:409.63053703308105ms total_cost_time:907.1474075317383ms,out_token_counter:17 mean_per_token_cost_time: 29.265698264626895ms prompt_token_num:4 gpu cache hit: False gpu_prompt_cache_len:0 gpu_prompt_cache_ratio:0.0 cpu cache hit: False cpu_prompt_cache_len:0 cpu_prompt_cache_ratio:0.0 disk cache hit: False disk_prompt_cache_len:0 disk_prompt_cache_ratio:0.0 mtp_avg_token_per_step:1.0 
127.0.0.1:38158 - "POST /generate HTTP/1.1" 200
DEBUG 01-02 12:23:59 [req_manager.py:78] freed all request size 1008
DEBUG 01-02 12:23:59 [infer_batch.py:172] free a batch state:
DEBUG 01-02 12:23:59 [infer_batch.py:172] radix refed token num 0
DEBUG 01-02 12:23:59 [infer_batch.py:172] radix hold token num 21
DEBUG 01-02 12:23:59 [infer_batch.py:172] mem manager can alloc token num 649603
DEBUG 01-02 12:23:59 [infer_batch.py:172] mem manager total size 649624
INFO 01-02 12:23:59 [batch.py:56] router release req id 8
INFO 01-02 12:23:59 [shm_req_manager.py:111] all shm req has been release ok

Signed-off-by: Xiaodong Ye <[email protected]>

gemini-code-assist · 2026-01-02T04:25:54Z

Summary of Changes

Hello @yeahdongcn, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates support for Moore Threads (MUSA) GPUs into LightLLM, broadening the framework's hardware ecosystem. The changes involve platform-specific device detection, adjustments to a core Triton kernel for MUSA compatibility, and the addition of a utility to identify MUSA's inter-GPU communication technology, setting the stage for future multi-GPU capabilities.

Highlights

MUSA GPU Support: Added comprehensive support for Moore Threads (MUSA) GPU platform, significantly expanding LightLLM's hardware compatibility.
Triton Kernel Compatibility: The _fwd_kernel_token_att1 function was adjusted to ensure full compatibility with the Triton version, specifically by explicitly casting attention values to float32.
Multi-GPU Readiness: Introduced the has_mtlink utility function, laying the groundwork for future enhancements to enable multi-GPU support on MUSA platforms.
Dependency Updates: Noted the requirement for users to upgrade torch and torch_musa to their latest versions for optimal functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

yeahdongcn · 2026-01-02T04:27:03Z

@helloyongyang Please take a look when you are available. Thanks.

gemini-code-assist

Code Review

This pull request adds support for Moore Threads (MUSA) GPUs, which is a great step towards broader hardware compatibility. The changes include platform detection logic, a new utility to check for MTLink, and a kernel modification for Triton compatibility. My review focuses on improving the robustness of the new device detection logic to prevent potential runtime errors and ensure it works correctly across different system configurations.

lightllm/utils/device_utils.py

shihaobai · 2026-01-04T03:47:57Z

Thanks for your contribution! I will review it as soon as i can.

yeahdongcn · 2026-01-05T07:16:09Z

With --mode triton_gqa_attention:

root@worker3218:/ws# python -m lightllm.server.api_server --model_dir /home/dist/Qwen3-0.6B/ --disable_cudagraph --host 0.0.0.0 --mode triton_gqa_attention
WARNING 01-05 15:14:17 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3.         Try to upgrade it.
WARNING 01-05 15:14:17 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
INFO 01-05 15:14:18 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 01-05 15:14:18 [__init__.py:38] - musa -> vllm_musa:register
INFO 01-05 15:14:18 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 01-05 15:14:18 [__init__.py:232] Platform plugin musa is activated
WARNING 01-05 15:14:18 [vllm_utils.py:18] vllm is not installed, you can't use the api of it.                    You can solve it by running `pip install vllm`.
INFO 01-05 15:14:18 [communication_op.py:57] deep_ep is not installed, you can't use the api of it.
INFO 01-05 15:14:18 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On
WARNING 01-05 15:14:18 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm
WARNING 01-05 15:14:18 [nixl_kv_transporter.py:19] nixl is not installed, which is required for pd disagreggation!!!
INFO 01-05 15:14:19 [shm_size_check.py:21] SHM check: Available=500.00 GB,Recommended=2.32 GB.Sufficient: True
INFO 01-05 15:14:19 [api_start.py:94] zmq mode head: ipc:///tmp/_28765_0_
INFO 01-05 15:14:19 [api_start.py:96] use tgi api: False
INFO 01-05 15:14:19 [api_start.py:233] alloced ports: [10174, 10010, 10228, 10304, 10034, 10148, 10298, 10063, 10113, 10232]
INFO 01-05 15:14:19 [api_start.py:284] all start args:Namespace(run_mode='normal', host='0.0.0.0', port=8000, httpserver_workers=1, zmq_mode='ipc:///tmp/_28765_0_', pd_master_ip='0.0.0.0', pd_master_port=1212, pd_decode_rpyc_port=42000, select_p_d_node_strategy='round_robin', config_server_host=None, config_server_port=None, nixl_pd_kv_page_num=16, nixl_pd_kv_page_size=1024, model_name='default_model_name', model_dir='/home/dist/Qwen3-0.6B/', tokenizer_mode='fast', load_way='HF', max_total_token_num=None, mem_fraction=0.9, batch_max_tokens=8448, eos_id=[151645], tool_call_parser=None, reasoning_parser=None, chat_template=None, running_max_req_size=1000, nnodes=1, node_rank=0, multinode_httpmanager_port=12345, multinode_router_gloo_port=20001, tp=1, dp=1, dp_balancer='bs_balancer', max_req_total_len=16384, nccl_host='127.0.0.1', nccl_port=28765, use_config_server_to_init_nccl=False, mode=['triton_gqa_attention'], trust_remote_code=False, disable_log_stats=False, log_stats_interval=10, disable_shm_warning=False, router_token_ratio=0.0, router_max_new_token_len=1024, router_max_wait_tokens=1, disable_aggressive_schedule=False, use_dynamic_prompt_cache=False, disable_dynamic_prompt_cache=False, chunked_prefill_size=4096, disable_chunked_prefill=False, diverse_mode=False, token_healing_mode=False, output_constraint_mode='none', first_token_constraint_mode=False, enable_multimodal=False, enable_multimodal_audio=False, enable_mps=False, disable_custom_allreduce=False, enable_custom_allgather=False, enable_tpsp_mix_mode=False, enable_dp_prefill_balance=False, enable_prefill_microbatch_overlap=False, enable_decode_microbatch_overlap=False, enable_flashinfer_prefill=False, enable_flashinfer_decode=False, enable_fa3=False, cache_capacity=200, embed_cache_storage_size=4, data_type='bfloat16', return_all_prompt_logprobs=False, use_reward_model=False, long_truncation_mode=None, use_tgi_api=False, health_monitor=False, metric_gateway=None, job_name='lightllm', grouping_key=[], push_interval=10, visual_infer_batch_size=1, visual_send_batch_size=1, visual_gpu_ids=[0], visual_tp=1, visual_dp=1, visual_nccl_ports=[29500], enable_monitor_auth=False, disable_cudagraph=True, enable_prefill_cudagraph=False, prefll_cudagraph_max_handle_token=512, graph_max_batch_size=256, graph_split_batch_size=32, graph_grow_step_size=16, graph_max_len_in_batch=16384, quant_type='none', quant_cfg=None, vit_quant_type='none', vit_quant_cfg=None, sampling_backend='triton', penalty_counter_mode='gpu_counter', ep_redundancy_expert_config_path=None, auto_update_redundancy_expert=False, enable_fused_shared_experts=False, mtp_mode=None, mtp_draft_model_dir=None, mtp_step=0, kv_quant_calibration_config_path=None, schedule_time_interval=0.03, enable_cpu_cache=False, cpu_cache_storage_size=2, cpu_cache_token_page_size=256, enable_disk_cache=False, disk_cache_storage_size=10, disk_cache_dir=None, enable_dp_prompt_cache_fetch=False, router_port=10174, detokenization_port=10010, http_server_port=10228, visual_port=10304, audio_port=10034, cache_port=10148, metric_port=10298, multi_level_kv_cache_port=10063, pd_node_infer_rpyc_ports=[10232], pd_node_id=191377501306505751402092569801999789346, pd_p_allowed_port_min=20000, pd_p_allowed_port_max=30000)
WARNING 01-05 15:14:25 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3.         Try to upgrade it.
WARNING 01-05 15:14:25 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
INFO 01-05 15:14:26 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 01-05 15:14:26 [__init__.py:38] - musa -> vllm_musa:register
INFO 01-05 15:14:26 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 01-05 15:14:26 [__init__.py:232] Platform plugin musa is activated
WARNING 01-05 15:14:26 [vllm_utils.py:18] vllm is not installed, you can't use the api of it.                    You can solve it by running `pip install vllm`.
INFO 01-05 15:14:26 [communication_op.py:57] deep_ep is not installed, you can't use the api of it.
2026-01-05 15:14:26 | server | 140121091523712 | INFO : server started on [0.0.0.0]:10298
INFO 01-05 15:14:26 [start_utils.py:37] init func start_metric_manager : init ok
WARNING 01-05 15:14:32 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3.         Try to upgrade it.
WARNING 01-05 15:14:32 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
WARNING 01-05 15:14:32 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3.         Try to upgrade it.
WARNING 01-05 15:14:32 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
INFO 01-05 15:14:33 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 01-05 15:14:33 [__init__.py:38] - musa -> vllm_musa:register
INFO 01-05 15:14:33 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 01-05 15:14:33 [__init__.py:232] Platform plugin musa is activated
WARNING 01-05 15:14:33 [vllm_utils.py:18] vllm is not installed, you can't use the api of it.                    You can solve it by running `pip install vllm`.
INFO 01-05 15:14:33 [communication_op.py:57] deep_ep is not installed, you can't use the api of it.
INFO 01-05 15:14:33 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 01-05 15:14:33 [__init__.py:38] - musa -> vllm_musa:register
INFO 01-05 15:14:33 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 01-05 15:14:33 [__init__.py:232] Platform plugin musa is activated
WARNING 01-05 15:14:33 [vllm_utils.py:18] vllm is not installed, you can't use the api of it.                    You can solve it by running `pip install vllm`.
INFO 01-05 15:14:33 [communication_op.py:57] deep_ep is not installed, you can't use the api of it.
INFO 01-05 15:14:33 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On
INFO 01-05 15:14:33 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On
WARNING 01-05 15:14:33 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm
WARNING 01-05 15:14:33 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm
INFO 01-05 15:14:33 [manager.py:36] pub_to_httpserver sendhwm 1000
WARNING 01-05 15:14:33 [nixl_kv_transporter.py:19] nixl is not installed, which is required for pd disagreggation!!!
2026-01-05 15:14:33 | server | 140121091523712 | INFO : accepted ('127.0.0.1', 59630) with fd 25
2026-01-05 15:14:33 | server | 140089932039744 | INFO : welcome ('127.0.0.1', 59630)
INFO 01-05 15:14:38 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On
WARNING 01-05 15:14:39 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3.         Try to upgrade it.
INFO 01-05 15:14:40 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 01-05 15:14:40 [__init__.py:38] - musa -> vllm_musa:register
INFO 01-05 15:14:40 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 01-05 15:14:40 [__init__.py:232] Platform plugin musa is activated
WARNING 01-05 15:14:40 [vllm_utils.py:18] vllm is not installed, you can't use the api of it.                    You can solve it by running `pip install vllm`.
WARNING 01-05 15:14:40 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
WARNING 01-05 15:14:40 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm
INFO 01-05 15:14:40 [communication_op.py:57] deep_ep is not installed, you can't use the api of it.
WARNING 01-05 15:14:40 [nixl_kv_transporter.py:19] nixl is not installed, which is required for pd disagreggation!!!
INFO 01-05 15:14:40 [model_rpc.py:67] Initialized RPC server for rank 0.
INFO 01-05 15:14:40 [model_rpc.py:168] use ChunkedPrefillBackend
INFO 01-05 15:14:41 [basemodel.py:157] Initial quantization. The default quantization method is none
pid 61063 Loading model weights with 1 workers: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.04it/s]
INFO 01-05 15:14:42 [mem_utils.py:37] mode setting params: ['triton_gqa_attention']
INFO 01-05 15:14:42 [mem_utils.py:57] Model kv cache using mode normal
INFO 01-05 15:14:42 [mem_manager.py:84] 69.54442710876465 GB space is available after load the model weight
INFO 01-05 15:14:42 [mem_manager.py:84] 0.109375 MB is the size of one token kv cache
INFO 01-05 15:14:42 [mem_manager.py:84] 651094 is the profiled max_total_token_num with the mem_fraction 0.9
INFO 01-05 15:14:42 [mem_manager.py:84] 
warming up:   0%|                                                                                                                                        | 0/12 [00:00<?, ?it/s]WARNING 01-05 15:14:53 [autotuner.py:169] No kernel config for silu_and_mul_fwd:v1 in {N=3072,out_dtype=torch.bfloat16}_MTT_S5000.json,the performance may be suboptimal!You can use LIGHTLLM_TRITON_AUTOTUNE_LEVEL=1 to enable autotune.
WARNING 01-05 15:14:53 [kernel_config.py:40] can not find config_path /ws/lightllm/common/all_kernel_configs/moe_silu_and_mul_kernel/{N=3072,out_dtype=torch.bfloat16}_MTT_S5000.json kernel name moe_silu_and_mul_kernel use default kernel setting
warming up: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:15<00:00,  1.31s/it]
INFO 01-05 15:15:00 [basemodel.py:812] begin check max_len infer
INFO 01-05 15:15:00 [basemodel.py:849] check max_len 8448 infer ok
INFO 01-05 15:15:15 [base_backend.py:185] loaded model class <class 'lightllm.models.qwen3.model.Qwen3TpPartModel'>
INFO 01-05 15:15:15 [manager.py:196] use req queue ChunkedPrefillQueue
INFO 01-05 15:15:15 [start_utils.py:37] init func start_router_process : init ok
INFO 01-05 15:15:15 [start_utils.py:37] init func start_detokenization_process : init ok
INFO 01-05 15:15:15 [api_start.py:58] start process pid 30185
INFO 01-05 15:15:15 [api_start.py:59] http server pid 56619
[2026-01-05 15:15:15 +0800] [56619] [INFO] Starting gunicorn 23.0.0
[2026-01-05 15:15:15 +0800] [56619] [INFO] Listening at: http://0.0.0.0:8000 (56619)
[2026-01-05 15:15:15 +0800] [56619] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2026-01-05 15:15:15 +0800] [57040] [INFO] Booting worker with pid: 57040
WARNING 01-05 15:15:22 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3.         Try to upgrade it.
WARNING 01-05 15:15:22 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
INFO 01-05 15:15:22 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 01-05 15:15:22 [__init__.py:38] - musa -> vllm_musa:register
INFO 01-05 15:15:22 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 01-05 15:15:22 [__init__.py:232] Platform plugin musa is activated
WARNING 01-05 15:15:22 [vllm_utils.py:18] vllm is not installed, you can't use the api of it.                    You can solve it by running `pip install vllm`.
INFO 01-05 15:15:22 [communication_op.py:57] deep_ep is not installed, you can't use the api of it.
INFO 01-05 15:15:22 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On
WARNING 01-05 15:15:22 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm
[2026-01-05 15:15:23 +0800] [57040] [INFO] Started server process [57040]
[2026-01-05 15:15:23 +0800] [57040] [INFO] Waiting for application startup.
INFO 01-05 15:15:23 [api_http.py:359] server start up
2026-01-05 15:15:23 | server | 140121091523712 | INFO : accepted ('127.0.0.1', 39874) with fd 26
2026-01-05 15:15:23 | server | 140089923647040 | INFO : welcome ('127.0.0.1', 39874)
2026-01-05 15:15:23 | server | 140121091523712 | INFO : accepted ('127.0.0.1', 39880) with fd 27
2026-01-05 15:15:23 | server | 140089915254336 | INFO : welcome ('127.0.0.1', 39880)
INFO 01-05 15:15:24 [req_id_generator.py:34] ReqIDGenerator init finished
INFO 01-05 15:15:24 [api_http.py:363] server start up ok, loop use is <uvloop.Loop running=True closed=False debug=False>
[2026-01-05 15:15:24 +0800] [57040] [INFO] Application startup complete.
INFO 01-05 15:15:24 [manager.py:417] recieved req X-Request-Id: X-Session-Id: start_time:2026-01-05 15:15:24 lightllm_req_id:8 
INFO 01-05 15:15:24 [manager.py:424] router recive req id 8 cost time 0.04026627540588379 s
DEBUG 01-05 15:15:24 [manager.py:322] Prefill Batch: batch_id=-1, time:1767597324.751087s req_ids:[8] 
DEBUG 01-05 15:15:24 [manager.py:322] 
INFO 01-05 15:15:24 [manager.py:55] detokenization recv req id 8 cost time 0.06332850456237793 s
DEBUG 01-05 15:15:24 [manager.py:253] dp_i 0 current batch size: 1 
DEBUG 01-05 15:15:24 [manager.py:253] dp_i 0 paused req num: 0 
DEBUG 01-05 15:15:24 [manager.py:253] dp_i 0 frozen token num: 0 
DEBUG 01-05 15:15:24 [manager.py:253] dp_i 0 estimated_peak_token_count: 39 
DEBUG 01-05 15:15:24 [manager.py:253] dp_i 0 token used ratio: 6.1435061604008025e-06 not contain prompt cache tree unrefed token
DEBUG 01-05 15:15:24 [manager.py:253] dp_i 0 token used ratio: 6.1435061604008025e-06 contain prompt cache tree unrefed token
DEBUG 01-05 15:15:27 [req_manager.py:78] freed all request size 1008
INFO 01-05 15:15:27 [manager.py:163] detoken release req id 8
DEBUG 01-05 15:15:27 [infer_batch.py:172] free a batch state:
DEBUG 01-05 15:15:27 [infer_batch.py:172] radix refed token num 0
DEBUG 01-05 15:15:27 [infer_batch.py:172] radix hold token num 21
DEBUG 01-05 15:15:27 [infer_batch.py:172] mem manager can alloc token num 651073
DEBUG 01-05 15:15:27 [infer_batch.py:172] mem manager total size 651094
INFO 01-05 15:15:27 [manager.py:611] X-Request-Id: X-Session-Id: start_time:2026-01-05 15:15:24 lightllm_req_id:8 first_token_cost:2299.657106399536ms total_cost_time:2559.7667694091797ms,out_token_counter:17 mean_per_token_cost_time: 15.300568412331973ms prompt_token_num:4 gpu cache hit: False gpu_prompt_cache_len:0 gpu_prompt_cache_ratio:0.0 cpu cache hit: False cpu_prompt_cache_len:0 cpu_prompt_cache_ratio:0.0 disk cache hit: False disk_prompt_cache_len:0 disk_prompt_cache_ratio:0.0 mtp_avg_token_per_step:1.0 
127.0.0.1:46578 - "POST /generate HTTP/1.1" 200
INFO 01-05 15:15:27 [batch.py:56] router release req id 8
INFO 01-05 15:15:27 [shm_req_manager.py:111] all shm req has been release ok

- Restore MUSA support in lightllm/__init__.py (from PR #1162) - Restore mtp_avg_token_per_step metrics (from PR #1169) - Restore diverse_stage2 kernel and configs (from PR #1174) - Restore test files for stage1/stage2 tuning - Fix linting issues in restored test file Co-Authored-By: Claude Opus 4.5 <[email protected]>

Support MThreads (MUSA) GPU

1a44a64

Signed-off-by: Xiaodong Ye <[email protected]>

gemini-code-assist bot reviewed Jan 2, 2026

View reviewed changes

lightllm/utils/device_utils.py Show resolved Hide resolved

lightllm/utils/device_utils.py Show resolved Hide resolved

lightllm/utils/device_utils.py Show resolved Hide resolved

yeahdongcn mentioned this pull request Jan 5, 2026

Update doc to add integrated projects MooreThreads/torchada#2

Merged

shihaobai merged commit 75ec504 into ModelTC:main Jan 6, 2026
1 check passed

yeahdongcn mentioned this pull request Jan 7, 2026

Update doc MooreThreads/torchada#3

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support MThreads (MUSA) GPU #1162

Support MThreads (MUSA) GPU #1162

Uh oh!

yeahdongcn commented Jan 2, 2026

Uh oh!

gemini-code-assist bot commented Jan 2, 2026

Uh oh!

yeahdongcn commented Jan 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shihaobai commented Jan 4, 2026

Uh oh!

yeahdongcn commented Jan 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Support MThreads (MUSA) GPU #1162

Support MThreads (MUSA) GPU #1162

Uh oh!

Conversation

yeahdongcn commented Jan 2, 2026

Testing Done

Uh oh!

gemini-code-assist bot commented Jan 2, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

yeahdongcn commented Jan 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shihaobai commented Jan 4, 2026

Uh oh!

yeahdongcn commented Jan 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants