DARIL: When Imitation Learning Outperforms Reinforcement Learning in Surgical Action Planning

Boels, Maxence; Robertshaw, Harry; Booth, Thomas C.; Dasgupta, Prokar; Granados, Alejandro; Ourselin, Sebastien

doi:10.1007/978-3-032-09784-2_18

Maxence Boels¹²,
Harry Robertshaw¹²,
Thomas C. Booth¹²,
Prokar Dasgupta¹²,
Alejandro Granados¹² &
…
Sebastien Ourselin¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 16298))

Included in the following conference series:

International Workshop on Collaborative Intelligence and Autonomy in Image-Guided Surgery

230 Accesses
1 Citation

Abstract

Surgical action planning requires predicting future instru-ment-verb-target triplets for real-time assistance. While teleoperated robotic surgery provides natural expert demonstrations for imitation learning (IL), reinforcement learning (RL) could potentially discover superior strategies through exploration. We present the first comprehensive comparison of IL versus RL for surgical action planning on CholecT50. Our Dual-task Autoregressive Imitation Learning (DARIL) baseline achieves 34.6% action triplet recognition mAP and 33.6% next frame prediction mAP with smooth planning degradation to 29.2% at 10-second horizons. We evaluated three RL variants: world model-based RL, direct video RL, and inverse RL enhancement. Surprisingly, all RL approaches underperformed DARIL—world model RL dropped to 3.1% mAP at 10s while direct video RL achieved only 15.9%. Our analysis reveals that distribution matching on expert-annotated test sets systematically favors IL over potentially valid RL policies that differ from training demonstrations. This challenges assumptions about RL superiority in sequential decision making and provides crucial insights for surgical AI development.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

RIDL: A Comprehensive Hybrid Pipeline Integrating Diffusion, Imitation, and Deep Reinforcement Learning via VR-Enhanced Data for Surgical Robotics

Deep Reinforcement Learning in Medical Science: Methods, Applications, and Future Directions

Providing Automated Real-Time Technical Feedback for Virtual Reality Based Surgical Training: Is the Simpler the Better?

References

Boels, M., Liu, Y., Dasgupta, P., Granados, A., Ourselin, S.: Swag: long-term surgical workflow prediction with generative-based anticipation. Inter. J. Comput. Assisted Radiol. Surgery, 1–11 (2025)
Google Scholar
Boels, M., Robertshaw, H., Booth, T.C., Granados, A., Dasgupta, P., Ourselin, S.: Surgical robot learning: From demonstration and simulation to world models-a review. arXiv preprint (2025)
Google Scholar
Hafner, D., Pasukonis, J., Ba, J., Lillicrap, T.: Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104 (2023)
Hansen, N., Su, H., Wang, X.: Td-mpc2: scalable, robust world models for continuous control. arXiv preprint arXiv:2310.16828 (2023)
Hu, A., et al.: Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080 (2023)
Kiyasseh, D., et al.: A vision transformer for decoding surgeon activity from surgical videos. Nat. Biomed. Eng. 7(6), 780–796 (2023)
Article Google Scholar
Liu, Y., et al.: Skit: a fast key information video transformer for online surgical phase recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 21074–21084 (2023)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928–1937. PmLR (2016)
Google Scholar
Nwoye, C.I., et al.: Cholectriplet 2021: a benchmark challenge for surgical action triplet recognition. Med. Image Anal. 86, 102803 (2023)
Article Google Scholar
Nwoye, C.I., et al.: Recognition of instrument-tissue interactions in endoscopic videos via action triplets. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12263, pp. 364–374. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59716-0_35
Chapter Google Scholar
Nwoye, C.I., Padoy, N.: Data splits and metrics for method benchmarking on surgical action triplet datasets. arXiv preprint arXiv:2204.05235 (2022)
Nwoye, C.I., et al.: Rendezvous: attention mechanisms for the recognition of surgical action triplets in endoscopic videos. Med. Image Anal. 78, 102433 (2022)
Article Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Google Scholar
Robertshaw, H., et al.: Reinforcement learning for safe autonomous two-device navigation of cerebral vessels in mechanical thrombectomy. Inter. J. Comput. Assisted Radiol. Surgery, 1–10 (2025)
Google Scholar
Robertshaw, H., Karstensen, L., Jackson, B., Granados, A., Booth, T.C.: Autonomous navigation of catheters and guidewires in mechanical thrombectomy using inverse reinforcement learning. Int. J. Comput. Assist. Radiol. Surg. 19(8), 1569–1578 (2024)
Article Google Scholar
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Shi, C., Zheng, Y., Fey, A.M.: Recognition and prediction of surgical gestures and trajectories using transformer models in robot-assisted surgery. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 8017–8024. IEEE (2022)
Google Scholar
Weerasinghe, K., Roodabeh, S.H.R., Hutchinson, K., Alemzadeh, H.: Multimodal transformers for real-time surgical activity prediction. In: 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 13323–13330. IEEE (2024)
Google Scholar
Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: AAAI, Chicago, IL, USA, vol. 8, pp. 1433–1438 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Surgical and Interventional Engineering, King’s College London, London, UK
Maxence Boels, Harry Robertshaw, Thomas C. Booth, Prokar Dasgupta, Alejandro Granados & Sebastien Ourselin

Authors

Maxence Boels
View author publications
Search author on:PubMed Google Scholar
Harry Robertshaw
View author publications
Search author on:PubMed Google Scholar
Thomas C. Booth
View author publications
Search author on:PubMed Google Scholar
Prokar Dasgupta
View author publications
Search author on:PubMed Google Scholar
Alejandro Granados
View author publications
Search author on:PubMed Google Scholar
Sebastien Ourselin
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Maxence Boels .

Editor information

Editors and Affiliations

The Chinese University of Hong Kong, Hong Kong, China
Qi Dou
Shanghai Jiao Tong University, Shanghai, China
Yutong Ban
National University of Singapore, Singapore, Singapore
Yueming Jin
University College London, London, UK
Sophia Bano
Johns Hopkins University, Baltimore, MD, USA
Mathias Unberath

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Boels, M., Robertshaw, H., Booth, T.C., Dasgupta, P., Granados, A., Ourselin, S. (2026). DARIL: When Imitation Learning Outperforms Reinforcement Learning in Surgical Action Planning. In: Dou, Q., Ban, Y., Jin, Y., Bano, S., Unberath, M. (eds) Collaborative Intelligence and Autonomy in Image-Guided Surgery. COLAS 2025. Lecture Notes in Computer Science, vol 16298. Springer, Cham. https://doi.org/10.1007/978-3-032-09784-2_18

Download citation

DOI: https://doi.org/10.1007/978-3-032-09784-2_18
Published: 02 January 2026
Publisher Name: Springer, Cham
Print ISBN: 978-3-032-09783-5
Online ISBN: 978-3-032-09784-2
eBook Packages: Computer ScienceComputer Science (R0)Springer Nature Proceedings Computer Science

Keywords

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)

DARIL: When Imitation Learning Outperforms Reinforcement Learning in Surgical Action Planning