2025-10-30🌟 We are pleased to announce the release of the first version of the XiYan-SQL training framework XiYan-SQLTraining. We welcome everyone to use it, and we will be adding more information to enhance this framework in the future.
The XiYan-SQLTraining framework is a post-training framework specifically designed for the Text-to-SQL task developed by XiYan. Currently, it mainly supports the following capabilities:
- Conversion of raw data to training data
- Training data augmentation
- Fine-tuning basic models for Text2SQL tasks
- Training the XiYanSQL MOE multi-dialect model
- Model inference/evaluation
- Continued GRPO training for Text2SQL
- Integration of different types of SQL models
- ... The framework is continuously being improved, and we welcome contributions from users!
- Create a Conda Environment: Use the following commands to create and activate a new environment for training:
conda create -n xiyansql python=3.10
conda activate xiyansql- Install Dependencies After activating the environment, run the following command to install the required dependencies:
pip install -r requirements.txtNVidia driver CUDA versions 11.8-12.4 have been tested and are compatible, but the versions of required dependencies can be upgraded as needed.
Please prepare the data in JSON LIST file format, as shown below, where each entry follows this structure:
[
{
"id": 0,
"conversations": [
{
"role": "user",
"content": "You are an SQLite expert, xxx..."
},
{
"role": "assistant",
"content": "SELECT xxx..."
}
],
"sql_type": "sqlite"
},
{
"id": 1,
"conversations": [
{
"role": "user",
"content": "You are an SQLite expert, xxx..."
},
{
"role": "assistant",
"content": "SELECT xxx..."
}
],
"sql_type": "sqlite"
}
]An example training data file can be found at train/datasets/train_examples.json.
You can also start constructing from raw data. The processes are located in the data/ folder:
- First, process the raw data. It is advisable to create a separate folder under
data_warehousefor each data chunk, e.g.,data_warehouse/bird_train. You can then generate a processed and integratable dataset using the following command:
bash data_processing.shThe input parameters are raw_data_path (path to raw data), db_conn_config (database configuration), processed_data_dir (path to save the processed data), save_mschema_dir (whether to save the m-schema file), and save_to_configs (whether to save the processed data in the data configuration file).
This processing mainly involves reading the database to generate the m-schema form of the database schema and writing the processed data into a complete configuration file warehouse for easy selection in subsequent uses.
A usage example is provided in data_processing.sh.
- Data assembly involves packaging at least one processed dataset into the final data for model training:
bash data_assembler.shThe input parameter dataset_config_path is the data configuration file that can contain multiple dataset blocks, and save_path is the final output path for the training data.
This process involves data assembly, data processing, and formatting the training data as per the prompts.
An example of usage is provided in data_assembler.sh.
The overall process is located in the train/ folder:
- Prepare the model; the script to download the model is provided in
train/utilsto choose a source based on your network conditions:
python model_download.py- The SFT training script is xiyan_sft.sh:
bash xiyan_sft.shYou need to prepare the training data, model, and training hyperparameters as described above. For larger models, consider enabling LoRA (it is recommended to first use the QWEN2.5 series model to start training).
- If training with LoRA, you need to merge the saved adapter with the original model. The script for this can be found in
utils/adapter_merge.py.
The overall process is in the evaluation/ folder; it is recommended to keep each part of the data in a separate folder, such as evaluation/bird_evaluation.
- Model inference:
bash sql_infer.shThe input parameters are model_name_or_path (model path), expr_version (version number), test_set_path (test set path), and batch_size (concurrent processing size).
- Evaluation of inference results:
bash sql_eval.shThe input parameters are pred_sql_path (predicted SQL path), test_sql_path (test set path containing ground-truth SQL), db_conn_config (database configuration), and save_eval_path (path to save evaluation results).
If you're interested in our research or products, please feel free to contact us.
Yifu Liu, zhencang.lyf@alibaba-inc.com
We welcome you to experience the intelligent query solutions developed based on XiYanSQL—XiYan GBI. Log into Alibaba Cloud Bailian - Application Square - XiYan GBI. Any product experience and effect optimization suggestions are welcome for discussion.
For product introduction, please visit: https://help.aliyun.com/zh/model-studio/user-guide/brief-introduction-of-gbi-products
To experience the product, please visit: https://bailian.console.aliyun.com/xiyan
Product Ding Group: 94725009401
If you find our work helpful, we welcome you to cite us.
@article{XiYanSQL,
title={XiYan-SQL: A Novel Multi-Generator Framework For Text-to-SQL},
author={Yifu Liu and Yin Zhu and Yingqi Gao and Zhiling Luo and Xiaoxia Li and Xiaorong Shi and Yuntao Hong and Jinyang Gao and Yu Li and Bolin Ding and Jingren Zhou},
year={2025},
eprint={2507.04701},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.04701},
}