Simple & Scalable Pretraining for Neural Architecture Research
ArchScale is a comprehensive toolkit for training and evaluating neural language models with a focus on architecture and scaling laws. It provides implementations of various state-of-the-art architectures, scaling techniques, training optimizations and evaluation tools in a unified codebase.
- [Mar. 30] Released the code for MoE training with HyperP (Hypersphere Parameterization) scaling, SqrtGate, and MuonH optimizer!
- [Sept. 25] Phi-4-mini-flash has been accepted by NeurIPS 2025!
- [July 25] Released the code for large-scale pre-training of Phi-4-mini-flash!
- [July 25] Released the code for training Decoder-Hybrid-Decoder Architectures (poster) with μP++, and the model checkpoint for Phi-4-mini-flash-reasoning
- Architectures: Transformers (MHA/GQA), various SSM/attention modules, Gated Memory Unit, YOCO, Differential Attention and flexible hybrid stacks (SambaY, Phi-4-mini-Flash, etc.).
- Mixture-of-Experts: Fine-grained token-choice routing with SonicMoE acceleration, shared experts, SqrtGate and global auxiliary load-balancing loss.
- Scaling Laws: HyperP, μP++, μP, Chinchilla FLOPs scaling, data scaling, and scaling laws for batch size, weight decay, and MoE granularity.
- Optimizers: MuonH, Muon, AdamW, and hybrid optimizer support.
- Research-Friendly: Easy adding/modifying architectures/scaling-laws/optimizers/scheduling/initialization, WYSIWYG philosophy for experiments logging.
- Performance: 🚀 Flash-Attention 4 + SonicMoE + FSDP2 distributed training, mixed precision, FP8/MXFP8 via TransformerEngine, activation checkpointing, and CPU offloading.
- Training: Data mixture support (JSON config), packed dataset with pre-tokenization, variable-length training for long-context, fused cross-entropy for large vocabulary stability, batch size ramp-up scheduling, and orthogonal/diagonal weight initialization.
- Evaluation: lm-eval integration for standard NLP benchmarks, long-context evaluation on RULER and Phonebook, Proof-Pile perplexity evaluation, LightEval-based reasoning evaluation (AIME, MATH-500, GPQA), and scaling curve fitting via plotting scripts.
We provide install.sh for bootstrapping the training environment. The script auto-detects GPU type (GB200/B100 vs H100/H200) and installs the appropriate PyTorch wheel along with all dependencies:
bash install.sh
source .venv/bin/activateThis installs PyTorch, Lightning, Flash Attention (v2/v3/v4), TransformerEngine, Mamba, causal-conv1d, flash-linear-attention, SonicMoE, and other required packages.
One can refer to the Samba codebase for SlimPajama data tokenization. We also provide the pre-tokenized SlimPajama data here.
Train Mixture-of-Experts models with SonicMoE acceleration:
torchrun --nnodes=1 --nproc_per_node=8 --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} pretrain.py \
--train_data_dir path/to/slim_pajama/data --val_data_dir path/to/slim_pajama/data \
--train_model transformer_gqa4_h2_moe --depth 8 \
--sparsity 8 --top_k 8 --share_expert true --global_aux true --sqrt_gate true \
--train_name v2scale_mup_muonh_ga_qknorm_sgate_shexp --fsdp2 trueKey MoE options include --sparsity (number of experts = sparsity * top_k), --top_k (experts per token), --share_expert (one shared dense expert), --global_aux (global load-balancing loss), and --sqrt_gate (SqrtGate for stable MoE granularity scaling under hypersphere optimization). MoE FLOPs scaling is supported with the same μP++ and hyperP transfer:
for depth in 8 12 16 20 24; do
act_ckpt=false
if [[ ${depth} -ge 20 ]]; then
act_ckpt=true
fi
torchrun --nnodes=8 --nproc_per_node=4 --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} pretrain.py \
--train_data_dir path/to/slim_pajama/data --val_data_dir path/to/slim_pajama/data \
--train_model transformer_gqa4_h2_moe --depth ${depth} --base_hps.warmup_tokens=0 \
--sparsity 8 --top_k 8 --share_expert true --global_aux true --sqrt_gate true \
--base_hps.t0=10.4e9 --base_hps.min_lr_mult=0.1 --resume="auto" --micro_bsz=16 --act_ckpt=${act_ckpt} \
--train_name v2scale_mup_muonh_ga_qknorm_sgate_shexp --fsdp2 true
doneThis trains the MoE model up-to 22.9B total parameters on 32xGB200.
See launch_scripts/ for more templates on MoE/dense model training and hyperparameter tuning with various scaling and ablation configurations explored in the paper.
Training across a scale from 110M to 3.3B-parameter SambaY model with μP++ and Chinchilla token scaling on 8 GPUs is as simple as:
for depth in 8 12 16 20 24; do
torchrun --nnodes=1 --nproc_per_node=8 --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} pretrain.py \
--train_data_dir path/to/slim_pajama/data --val_data_dir path/to/slim_pajama/data \
--train_model sambay --depth ${depth} \
--train_name scaling_mup
doneIn the backend, a dataclass BaseHyperparameters defines the optimization related HyperParameters (HPs) for a d8 (depth=8) model, and the scaling laws defined in setup function will transfer these HPs to the actual HPs used at the target depth such as d12, d16 or d24. After the training finished, we can use the plotting scripts to fit the scaling curves and compare the fitted scaling parameters between different architectures.
To study the data scaling law, we can scale from 100B to 600B tokens for a 1B-parameter Transformer++ model with μP++ and tied embeddings on 64 GPUs using the following script:
for tok in 1e11 2e11 3e11 4e11 5e11 6e11; do
torchrun --nnodes=8 --nproc_per_node=8 --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} pretrain.py \
--train_data_dir path/to/slim_pajama/data --val_data_dir path/to/slim_pajama/data \
--train_model transformer --depth 16 --max_tokens ${tok} \
--train_name scaling_mup_tie
doneAfter shuffling and pre-tokenizing the ProLong-64K data (Pre-tokenized data is here!), we can train a d16 model with 32K sequence length and 40B tokens on 8 GPUs using the following script:
torchrun --nnodes=1 --nproc_per_node=8 --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} pretrain.py \
--train_data_dir path/to/prolong/data --val_data_dir path/to/prolong/data \
--train_model transformer --depth 16 --ctx_len 32768 --max_tokens 4e10 \
--train_name scaling_mup_rbase_varlenwhere the symbol in the train_name, rbase, will trigger the model use a larger RoPE base for long-context training and varlen will applies variable length training that seperates documents based on the EOS tokens. Our codebase currently supports training with a maximum of 128K sequence length for a d20 model with --fsdp_save_mem=true.
For variable length training on Mamba-1 based models, extra dependencies need to be installed:
git clone https://github.com/zigzagcai/varlen_mamba.git --branch feat/add-cu_seqlens
cd varlen_mamba
pip install --no-build-isolation -e .ArchScale provides comprehensive evaluation support for trained models across multiple domains:
Evaluate trained models on common language understanding tasks for SambaY architecture with multiple GPUs:
accelerate launch eval.py --model ArchScale \
--model_args pretrained=path/to/checkpoint.pth,config="sambay_d16" \
--tasks wikitext,lambada_openai,arc_easy,arc_challenge,winogrande,hellaswag,piqa,social_iqa \
--device cuda --batch_size 16 --trust_remote_codeThe script will infer the μP++ and architecture modification based on name of ckpt path.
Evaluate long-context perplexity using the Proof-Pile dataset with sliding window inference (following LongLoRA):
python eval_proofpile.py \
--checkpoint_path path/to/checkpoint.pth \
--config "sambay_d16" \
--seq_length 32768 \
--sliding_window 256 \
--batch_size 1Evaluate long-context capabilities using the RULER benchmark with multiple GPUs:
accelerate launch eval.py --model ArchScale \
--model_args pretrained=path/to/checkpoint.pth,config="sambay_d16",max_length=32768,tokenizer=Orkhan/llama-2-7b-absa \
--metadata='{"max_seq_lengths":[32768]}' \
--tasks niah_single_1 --device cuda --batch_size 8 --trust_remote_codeThis runs a simple needle-in-a-haystack task at 32K context length.
Test long-context retrieval using the Phonebook benchmark with 32K context length:
python eval_phonebook.py \
--checkpoint_path path/to/checkpoint.pth \
--config "model_config" \
--min_eval_len 1850 \
--max_eval_len 1850 \
--output_dir results_dir \
--eval_batch_size 4Evaluate reasoning capabilities on mathematical and scientific tasks (AIME, MATH-500, GPQA) using LightEval with vLLM backend:
./eval_reason/eval_reason.sh 42 microsoft/Phi-4-mini-flash-reasoning aime24 output_dirThe reasoning evaluation supports multi-GPU evaluation with configurable generation parameters (temperature, top-p, max tokens). The script requires lighteval and math-verify dependencies. We currently provide the vLLM inference support in this PR.
The plots/ directory provides scripts for fitting and visualizing scaling curves:
python plots/plot_moe_scaling_comparison.py # MoE FLOPs scaling curves
python plots/plot_scaling_comparison.py # Dense model scaling comparison
python plots/plot_muonh_comparison.py # MuonH vs baseline comparison
python plots/plot_bsz_scaling.py # Batch size scaling analysis
python plots/plot_stability.py # Training stability analysisIf you find our work useful, please consider citing:
@article{ren2026rethinking,
title = {Rethinking Language Model Scaling under Transferable Hypersphere Optimization},
author = {Liliang Ren and Yang Liu and Yelong Shen and Weizhu Chen},
year = {2026},
journal = {arXiv preprint arXiv: 2603.28743}
}
@software{archscale2025,
title={ArchScale: Simple and Scalable Pretraining for Neural Architecture Research},
author={Liliang Ren and Zichong Li and Yelong Shen},
year={2025},
url={https://github.com/microsoft/ArchScale}
}
@article{ren2025decoder,
title={Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation},
author={Liliang Ren and Congcong Chen and Haoran Xu and Young Jin Kim and Adam Atkinson and Zheng Zhan and Jiankai Sun and Baolin Peng and Liyuan Liu and Shuohang Wang and Hao Cheng and Jianfeng Gao and Weizhu Chen and Yelong Shen},
journal={arXiv preprint arXiv:2507.06607},
year={2025}
}This project is licensed under the MIT License - see the LICENSE file for details.
Happy scaling! 🚀



