Abstract
Surgical action planning requires predicting future instru-ment-verb-target triplets for real-time assistance. While teleoperated robotic surgery provides natural expert demonstrations for imitation learning (IL), reinforcement learning (RL) could potentially discover superior strategies through exploration. We present the first comprehensive comparison of IL versus RL for surgical action planning on CholecT50. Our Dual-task Autoregressive Imitation Learning (DARIL) baseline achieves 34.6% action triplet recognition mAP and 33.6% next frame prediction mAP with smooth planning degradation to 29.2% at 10-second horizons. We evaluated three RL variants: world model-based RL, direct video RL, and inverse RL enhancement. Surprisingly, all RL approaches underperformed DARIL—world model RL dropped to 3.1% mAP at 10s while direct video RL achieved only 15.9%. Our analysis reveals that distribution matching on expert-annotated test sets systematically favors IL over potentially valid RL policies that differ from training demonstrations. This challenges assumptions about RL superiority in sequential decision making and provides crucial insights for surgical AI development.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Boels, M., Liu, Y., Dasgupta, P., Granados, A., Ourselin, S.: Swag: long-term surgical workflow prediction with generative-based anticipation. Inter. J. Comput. Assisted Radiol. Surgery, 1–11 (2025)
Boels, M., Robertshaw, H., Booth, T.C., Granados, A., Dasgupta, P., Ourselin, S.: Surgical robot learning: From demonstration and simulation to world models-a review. arXiv preprint (2025)
Hafner, D., Pasukonis, J., Ba, J., Lillicrap, T.: Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104 (2023)
Hansen, N., Su, H., Wang, X.: Td-mpc2: scalable, robust world models for continuous control. arXiv preprint arXiv:2310.16828 (2023)
Hu, A., et al.: Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080 (2023)
Kiyasseh, D., et al.: A vision transformer for decoding surgeon activity from surgical videos. Nat. Biomed. Eng. 7(6), 780–796 (2023)
Liu, Y., et al.: Skit: a fast key information video transformer for online surgical phase recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 21074–21084 (2023)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928–1937. PmLR (2016)
Nwoye, C.I., et al.: Cholectriplet 2021: a benchmark challenge for surgical action triplet recognition. Med. Image Anal. 86, 102803 (2023)
Nwoye, C.I., et al.: Recognition of instrument-tissue interactions in endoscopic videos via action triplets. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12263, pp. 364–374. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59716-0_35
Nwoye, C.I., Padoy, N.: Data splits and metrics for method benchmarking on surgical action triplet datasets. arXiv preprint arXiv:2204.05235 (2022)
Nwoye, C.I., et al.: Rendezvous: attention mechanisms for the recognition of surgical action triplets in endoscopic videos. Med. Image Anal. 78, 102433 (2022)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Robertshaw, H., et al.: Reinforcement learning for safe autonomous two-device navigation of cerebral vessels in mechanical thrombectomy. Inter. J. Comput. Assisted Radiol. Surgery, 1–10 (2025)
Robertshaw, H., Karstensen, L., Jackson, B., Granados, A., Booth, T.C.: Autonomous navigation of catheters and guidewires in mechanical thrombectomy using inverse reinforcement learning. Int. J. Comput. Assist. Radiol. Surg. 19(8), 1569–1578 (2024)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Shi, C., Zheng, Y., Fey, A.M.: Recognition and prediction of surgical gestures and trajectories using transformer models in robot-assisted surgery. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 8017–8024. IEEE (2022)
Weerasinghe, K., Roodabeh, S.H.R., Hutchinson, K., Alemzadeh, H.: Multimodal transformers for real-time surgical activity prediction. In: 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 13323–13330. IEEE (2024)
Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: AAAI, Chicago, IL, USA, vol. 8, pp. 1433–1438 (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2026 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Boels, M., Robertshaw, H., Booth, T.C., Dasgupta, P., Granados, A., Ourselin, S. (2026). DARIL: When Imitation Learning Outperforms Reinforcement Learning in Surgical Action Planning. In: Dou, Q., Ban, Y., Jin, Y., Bano, S., Unberath, M. (eds) Collaborative Intelligence and Autonomy in Image-Guided Surgery. COLAS 2025. Lecture Notes in Computer Science, vol 16298. Springer, Cham. https://doi.org/10.1007/978-3-032-09784-2_18
Download citation
DOI: https://doi.org/10.1007/978-3-032-09784-2_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-032-09783-5
Online ISBN: 978-3-032-09784-2
eBook Packages: Computer ScienceComputer Science (R0)Springer Nature Proceedings Computer Science