Structural Reasoning Improves Molecular Understanding of LLM
Summary : MSR framework explicitly incorporates key structural features into LLM reasoning, supporting both known target molecules (with explicit structure info) and unknown target molecules (through structural feature guidance).
git clone https://github.com/yunhuijang/MSR.git
cd MSR
conda env create -f environment.yaml
conda activate msr
We provide the following fine-tuned checkpoints:
Model
Description
Path
ChemT5-small-analytic
Molecule-to-text model
output/chemt5-small-analytic/
ChemT5-small-t2m-reason
Text-to-molecule reasoning model
output/chemt5-small-t2m-reason/
ChemT5-small-t2m-answer
Text-to-molecule answer model
output/chemt5-small-t2m-answer/
python model/one_stage_generator_mol2text.py \
--architecture multitask-text-and-chemistry-t5-small-standard \
--cot_mode multiset_formula-chain-aromatic-con_ring_name-func_simple-chiral-weight-name \
--wandb_mode online \
--train_batch_size 8 \
--eval_batch_size 8 \
--epochs 250 \
--model_id GT4SD \
--weight_decay 0 \
--learning_rate 6e-4 \
--warmup_ratio 0.1 \
--check_val_every_n_epoch 20 \
--lr_scheduler_type linear \
--max_length 820 \
--generation_mode \
--max_new_tokens 512
python model/reasoning_generator.py \
--architecture multitask-text-and-chemistry-t5-small-standard \
--cot_mode multiset_formula-chain-aromatic-con_ring_name-func_simple-chiral-weight-name \
--wandb_mode online \
--train_batch_size 8 \
--eval_batch_size 8 \
--epochs 250 \
--model_id GT4SD \
--max_length 820 \
--generation_mode \
--max_new_tokens 256 \
--check_val_every_n_epoch 20 \
--weight_decay 0 \
--learning_rate 6e-4 \
--warmup_ratio 0 \
--lr_scheduler_type linear
python model/answer_generator.py \
--architecture multitask-text-and-chemistry-t5-small-standard \
--cot_mode multiset_formula-chain-aromatic-con_ring_name-func_simple-chiral-weight-name \
--select_cot_mode chain-aromatic-con_ring_name-func_simple-chiral \
--wandb_mode online \
--train_batch_size 8 \
--eval_batch_size 8 \
--epochs 250 \
--model_id GT4SD \
--max_length 820 \
--generation_mode \
--max_new_tokens 512 \
--check_val_every_n_epoch 20 \
--weight_decay 0 \
--learning_rate 6e-4 \
--warmup_ratio 0 \
--lr_scheduler_type linear \
--is_iterative
Common arguments (all scripts)
Argument
Type
Default
Description
--architecture
str
multitask-text-and-chemistry-t5-small-standard
Model architecture (see choices below)
--cot_mode
str
varies
Structural features for chain-of-thought reasoning. Combine multiple features with - (see choices below)
--wandb_mode
str
disabled
WandB logging mode (e.g., online, disabled)
--learning_rate
float
varies
Learning rate
--train_batch_size
int
varies
Training batch size per device
--eval_batch_size
int
varies
Evaluation batch size per device
--weight_decay
float
varies
Weight decay for optimizer
--epochs
int
100
Number of training epochs
--check_val_every_n_epoch
int
varies
Run validation every N epochs
--max_length
int
512
Maximum input sequence length
--max_new_tokens
int
512
Maximum number of tokens to generate
--model_id
str
GT4SD
HuggingFace model ID. Choices: laituan245, QizhiPei, GT4SD
--dataset_name
str
lm
Dataset to use. Choices: molt5 (ChEBI-20), lm (LPM-24)
--warmup_ratio
float
0
Warmup ratio for learning rate scheduler
--lr_scheduler_type
str
linear
Learning rate scheduler type
Answer module arguments (answer_generator.py only)
Argument
Type
Default
Description
--select_cot_mode
str
aromatic-con_ring_name-func_simple
Subset of CoT features to use from reasoning output
--is_iterative
flag
False
matching ratio-based rejection sampling with beam search
--num_iter
int
5
Number of beam search (used with --is_iterative)
Value
Description
multitask-text-and-chemistry-t5-small-standard
ChemT5-small (default)
multitask-text-and-chemistry-t5-base-standard
ChemT5-base
molt5-small
MolT5-small
molt5-base
MolT5-base
molt5-large
MolT5-large
Combine features with - separator (e.g., chain-aromatic-func_simple).
Feature
Description
multiset_formula
Molecular formula as multiset
chain
Carbon chain structure
aromatic
Aromatic ring information
con_ring_name
Ring names
func_simple
Functional groups
chiral
Chirality information
@inproceedings {jang-etal-2025-structural ,
title = " Structural Reasoning Improves Molecular Understanding of {LLM}" ,
author = " Jang, Yunhui and
Kim, Jaehyung and
Ahn, Sungsoo" ,
editor = " Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher" ,
booktitle = " Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)" ,
month = jul,
year = " 2025" ,
address = " Vienna, Austria" ,
publisher = " Association for Computational Linguistics" ,
url = " https://aclanthology.org/2025.acl-long.1023/" ,
doi = " 10.18653/v1/2025.acl-long.1023" ,
pages = " 21016--21036" ,
ISBN = " 979-8-89176-251-0" }