Loading…
In-person
24-26 March, 2026
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon Europe 2026 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Central European Time (CET) (UTC +1). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 

Thursday March 26, 2026 14:30 - 15:00 CET
Achieving scalable and fault-tolerant distributed AI model training that runs efficiently across multiple nodes remains a key challenge for platform administrators and ML engineers. This problem is further exacerbated by interactive GPU workloads, such as Jupyter notebooks, that generate intermittent computations followed by idle periods while users refine their code and explore the results.

This talk will present how transparent GPU checkpointing can be integrated with Kubernetes to improve both cost efficiency and cluster utilization for distributed AI workloads. By automatically capturing and restoring the state of training jobs, this approach enables seamless recovery from preemptions or failures. This session will also explore how checkpoint policies integrate with the Kueue, JobSet, and TrainJob APIs for Kubernetes-native, infrastructure-level checkpointing of GPU workloads - empowering users to leverage preemptible spot instances for reliable, cost-effective AI model training.
Speakers
avatar for Andrey Velichkevich

Andrey Velichkevich

AI Engineer, Apple
Andrey Velichkevich is a Senior Software Engineer at Apple and is a key contributor to the Kubeflow open-source project. He is a member of Kubeflow Steering Committee and a co-chair of Kubeflow AutoML and Training WG. Additionally, Andrey is an active member of the CNCF WG AI. He... Read More →
avatar for Viktória Spišaková

Viktória Spišaková

PhD student & IT architect jr, Masaryk University
PhD student at FI MUNI, IT architect jr. at ICS MUNI with several years of experience with providing Kubernetes at e-INFRA CZ
avatar for Radostin Stoyanov

Radostin Stoyanov

PhD Student, University of Oxford
Radostin Stoyanov is a PhD student in the Scientific Computing research group at the University of Oxford. His research focuses on improving the resilience and performance of HPC and cloud computing systems.
Thursday March 26, 2026 14:30 - 15:00 CET
Elicium 2
  AI + ML

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link