Franz Louis Cesista

Franz Louis Cesistahttps://leloykun.github.io/Recent content on Franz Louis CesistaHugo -- 0.147.9enThu, 11 Jun 2026 00:00:00 +0000LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifoldhttps://leloykun.github.io/papers/lora-muon/Thu, 11 Jun 2026 00:00:00 +0000https://leloykun.github.io/papers/lora-muon/Low-Rank Adaptation (LoRA) significantly reduces compute and memory costs for finetuning Deep Learning models but is often harder to tune than dense training: when using factor-wise optimizers such as AdamW, it is sensitive to initialization choices, its optimal learning rates transfer poorly across ranks, and it often fails to beat dense baselines. We derive LoRA-Muon by applying the Muon optimizer's spectral steepest-descent rule to the low-rank setting. Along with our split weight-decay rule, our main claim is that LoRA-Muon is a good low-rank proxy for full-rank Muon and Shampoo-family optimizers. Its optimal learning rates transfer across rank, width, depth, and factor-rescaling. In our compute-matched TinyShakespeare study, a rank-2 proxy recovers the dense best tested learning rate, and a rank-32 LoRA-Muon run attains lower mean validation loss than the dense baseline in the seed-averaged sweep. We further show that the Spectron optimizer depends on arbitrary factor scaling, so it would likely be a poor fit when finetuning starts from badly imbalanced factors, and that LoRA-RITE's simplified QR-coordinate core implements the same spectral update. LoRA-Muon computes that update without QR-decomposition and avoids storing second moments, making it more accelerator-friendly and memory-efficient.Lean4-TileLang Tensor Program Superoptimizer [WIP]https://leloykun.github.io/ponder/lean4-tilelang/Mon, 11 May 2026 00:00:00 +0000https://leloykun.github.io/ponder/lean4-tilelang/Deriving Flash Attention 2, FlashNorm, and other 'flash' kernels automatically with Lean4, lowered to TileLang.Frequency Domain Muon for Convolutional Neural Networks: Simplifiedhttps://leloykun.github.io/ponder/freqmuon/Fri, 06 Mar 2026 00:00:00 +0000https://leloykun.github.io/ponder/freqmuon/The 'correct' way to apply Muon to CNNsLUCID-MoE: Mixture of Experts with Preconditioned Routinghttps://leloykun.github.io/ponder/lucid-moe/Tue, 17 Feb 2026 00:00:00 +0000https://leloykun.github.io/ponder/lucid-moe/Sharper Mixture-of-Experts routing with LUCID preconditioning.Error-Compensating Optimizers: ECO-AdamW, ECO-Muon, and Beyondhttps://leloykun.github.io/ponder/eco/Tue, 10 Feb 2026 00:00:00 +0000https://leloykun.github.io/ponder/eco/Quantized training without full-precision master weights, extended to handle weight decay and matrix LMOs.Bidirectional-PRISM: Kronecker-Factored Optimization via Anisotropic Spectral Shapinghttps://leloykun.github.io/ponder/shampoo-prism/Wed, 04 Feb 2026 00:00:00 +0000https://leloykun.github.io/ponder/shampoo-prism/A novel optimizer that combines Shampoo-style preconditioning with PRISM's anisotropic spectral shaping to adaptively suppress noisy gradient directions while maximally descending under the spectral norm trust-region constraint.Steepest Descent on Affine-Conic Representable Manifolds with Boundary via Dual Ascenthttps://leloykun.github.io/ponder/steepest-descent-affine-conic/Fri, 09 Jan 2026 00:00:00 +0000https://leloykun.github.io/ponder/steepest-descent-affine-conic/Novel optimizers for maximally descending on the loss landscape while satisfying strict weight constraints.Steepest Descent on the Birkhoff Polytope Equipped with the Spectral Normhttps://leloykun.github.io/ponder/steepest-descent-doubly-stochastic/Sun, 04 Jan 2026 00:00:00 +0000https://leloykun.github.io/ponder/steepest-descent-doubly-stochastic/We derive an optimizer that performs steepest descent on the Birkhoff polytope equipped with the spectral norm via dual ascent. We show that it yields larger effective weight updates than naive LMO-based optimizers.Convergence Bounds for Steepest Descent Under Arbitrary Normshttps://leloykun.github.io/ponder/steepest-descent-convergence/Thu, 11 Dec 2025 00:00:00 +0000https://leloykun.github.io/ponder/steepest-descent-convergence/First-order optimization under arbitrary norms with Nesterov momentum (and decoupled weight decay) yields a universal convergence bound. Our results generalize to norms not induced by inner products, and also considers batch size.Critical Batch Size for Steepest Descent Under Arbitrary Normshttps://leloykun.github.io/ponder/steepest-descent-crit-bz/Sat, 22 Nov 2025 00:00:00 +0000https://leloykun.github.io/ponder/steepest-descent-crit-bz/First-order optimization under arbitrary norms with Nesterov momentum (and decoupled weight decay) yields universal critical batch size scaling laws. Under an additional local-LMO assumption, the same analysis also heuristically supports square-root learning-rate scaling with batch size.Rethinking Maximal Update Parametrization: Steepest Descent on Finsler-Structured (Matrix) Geometries via Dual Ascenthttps://leloykun.github.io/ponder/steepest-descent-finsler-dual-ascent/Wed, 29 Oct 2025 00:00:00 +0000https://leloykun.github.io/ponder/steepest-descent-finsler-dual-ascent/To guarantee fast and robust model training, we can recast the optimization problem as steepest descent on Finsler-structured geometries. Here we show how to compute the optimal updates via dual ascent.Rethinking Maximal Update Parametrization: Steepest Descent on the Spectral Ballhttps://leloykun.github.io/ponder/rethinking-mup-spectral-ball/Wed, 15 Oct 2025 00:00:00 +0000https://leloykun.github.io/ponder/rethinking-mup-spectral-ball/Novel optimizers for maximally updating both the weights and activations of neural networks while keeping weight norms under control. To get there, we needed to invent an efficient, GPU/TPU-friendly method for eigenvalue clipping and solve the Steepest Descent problem on the Positive Semidefinite Cone, Convex Spectrahedron, and finally on the Spectral Ball.Steepest Descent on Finsler-Structured (Matrix) Manifoldshttps://leloykun.github.io/ponder/steepest-descent-finsler/Wed, 20 Aug 2025 00:00:00 +0000https://leloykun.github.io/ponder/steepest-descent-finsler/Fast and robust model training.Heuristic Solutions for Steepest Descent on the Stiefel Manifoldhttps://leloykun.github.io/ponder/steepest-descent-stiefel/Fri, 18 Jul 2025 00:00:00 +0000https://leloykun.github.io/ponder/steepest-descent-stiefel/What would Muon look like if we constrained the weights to be semi-orthogonal?Training Transformers with Enforced Lipschitz Boundshttps://leloykun.github.io/papers/lipschitz-transformers/Thu, 17 Jul 2025 00:00:00 +0000https://leloykun.github.io/papers/lipschitz-transformers/Neural networks are often highly sensitive to input and weight perturbations. This sensitivity has been linked to pathologies such as vulnerability to adversarial examples, divergent training, and overfitting. To combat these problems, past research has looked at building neural networks entirely from Lipschitz components. However, these techniques have not matured to the point where researchers have trained a modern architecture such as a transformer with a Lipschitz certificate enforced beyond initialization. To explore this gap, we begin by developing and benchmarking novel, computationally-efficient tools for maintaining norm-constrained weight matrices. Applying these tools, we are able to train transformer models with Lipschitz bounds enforced throughout training. We find that optimizer dynamics matter: switching from AdamW to Muon improves standard methods -- weight decay and spectral normalization -- allowing models to reach equal performance with a lower Lipschitz bound. Inspired by Muon's update having a fixed spectral norm, we co-design a weight constraint method that improves the Lipschitz vs. performance tradeoff on MLPs and 2M parameter transformers. Our 2-Lipschitz transformer on Shakespeare text reaches validation accuracy 60%. Scaling to 145M parameters, our 10-Lipschitz transformer reaches 21% accuracy on internet text. However, to match the NanoGPT baseline validation accuracy of 39.4%, our Lipschitz upper bound increases to 10^264. Nonetheless, our Lipschitz transformers train without stability measures such as layer norm, QK norm, and logit tanh softcapping.Sensitivity and Sharpness of n-Simplicial Attentionhttps://leloykun.github.io/ponder/lipschitz-n-simplical-transformer/Sun, 06 Jul 2025 00:00:00 +0000https://leloykun.github.io/ponder/lipschitz-n-simplical-transformer/Towards a maximal update parameterization of n-simplicial attentionAdam with Aggressive Gradient Clipping ≈ Smoothed SignSGD/NormSGDhttps://leloykun.github.io/ponder/adam-aggressive-clipping/Thu, 03 Jul 2025 00:00:00 +0000https://leloykun.github.io/ponder/adam-aggressive-clipping/Why does Adam with aggressive gradient value/norm clipping have sparse updates and do well with higher learning rates? Here we show that it asymptotically matches Smoothed SignSGD in the value-clipping limit and tracks a rescaled Smoothed NormSGD direction in the norm-clipping limit.Fast, Numerically Stable, and Auto-Differentiable Spectral Clipping via Newton-Schulz Iterationhttps://leloykun.github.io/ponder/spectral-clipping/Mon, 23 Jun 2025 00:00:00 +0000https://leloykun.github.io/ponder/spectral-clipping/A small step towards hardware-architecture-optimizer codesign in deep learning.Muon and a Selective Survey on Steepest Descent in Riemannian and Non-Riemannian Manifoldshttps://leloykun.github.io/ponder/steepest-descent-non-riemannian/Thu, 03 Apr 2025 00:00:00 +0000https://leloykun.github.io/ponder/steepest-descent-non-riemannian/Muon from first principles, what makes it different from other optimizers, and why it works so well.Napkin Math on Non-Euclidean Trust Region Optimizationhttps://leloykun.github.io/ponder/napkin-math-trust-region-opt/Mon, 24 Mar 2025 00:00:00 +0000https://leloykun.github.io/ponder/napkin-math-trust-region-opt/A possible reason why Muon converges faster & does better at higher learning rates than Adam.Block Matrix Formulation of Linear Attention Mechanismshttps://leloykun.github.io/ponder/blockmat-linear-attn/Sun, 16 Mar 2025 00:00:00 +0000https://leloykun.github.io/ponder/blockmat-linear-attn/The block matrix formulation of linear attention mechanisms, multi-step online gradient descent at inference time, and chunk-wise parallelism.Steepest Descent Under Schatten-p Normshttps://leloykun.github.io/ponder/steepest-descent-schatten-p/Thu, 27 Feb 2025 00:00:00 +0000https://leloykun.github.io/ponder/steepest-descent-schatten-p/Why Muon still work despite not perfectly semi-orthogonalizing the gradients.Squeezing 1-2% Efficiency Gains Out of Muon by Optimizing the Newton-Schulz Coefficientshttps://leloykun.github.io/ponder/muon-opt-coeffs/Fri, 21 Feb 2025 00:00:00 +0000https://leloykun.github.io/ponder/muon-opt-coeffs/Simply switching to Muon can already get you 2x efficiency gains. But you can squeeze out an extra 1-2% by optimizing the Newton-Schulz coefficients.CASPR Without Accumulation is Muonhttps://leloykun.github.io/ponder/caspr-wo-accum-is-muon/Thu, 13 Feb 2025 00:00:00 +0000https://leloykun.github.io/ponder/caspr-wo-accum-is-muon/The CASPR optimizer, a variant of Shampoo, reduces to Muon when we remove the accumulation on the preconditioners.GRPO's Main Flawhttps://leloykun.github.io/ponder/grpo-flaw/Tue, 11 Feb 2025 00:00:00 +0000https://leloykun.github.io/ponder/grpo-flaw/GRPO may not be the best choice for training reasoning models. Here's why.(Linear) Attention as Test-Time Regressionhttps://leloykun.github.io/ponder/test-time-regression/Mon, 27 Jan 2025 00:00:00 +0000https://leloykun.github.io/ponder/test-time-regression/A unifying framework for linear attention mechanisms as test-time regression and how to parallelize training and inference.Deep Learning Optimizers as Steepest Descent in Normed Spaceshttps://leloykun.github.io/ponder/steepest-descent-opt/Sun, 20 Oct 2024 00:00:00 +0000https://leloykun.github.io/ponder/steepest-descent-opt/Instead of asking, 'Which optimizer should I use?' ask, 'In which space do my features live in?'Multimodal Structured Generationhttps://leloykun.github.io/personal-projects/mmsg/Sun, 14 Jul 2024 00:00:00 +0000https://leloykun.github.io/personal-projects/mmsg/Generate interleaved text and image content in a structured format you can directly pass to downstream APIs.Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Reporthttps://leloykun.github.io/papers/mmsg/Mon, 17 Jun 2024 00:00:00 +0000https://leloykun.github.io/papers/mmsg/Multimodal Foundation Models (MMFMs) have shown remarkable performance on various computer vision and natural language processing tasks. However, their performance on particular tasks such as document understanding is still limited. They also require more compute, time, and engineering resources to finetune and deploy compared to traditional, unimodal models. In this report, we present Multimodal Structured Generation, a general framework which constrains the output logits of frozen MMFMs to force them to reason before responding with structured outputs that downstream APIs can parse and use. We provide a detailed account of our approach, including the technical details, theoretical discussions, and final evaluation results in the 2nd Multimodal Foundation Models Challenge hosted by the Computer Vision and Pattern Recognition (CVPR) conference. Our approach achieved the second highest score in the hidden test set for Phase 2 and third highest overall. This shows the method's ability to generalize to unseen tasks. And that simple engineering can beat expensive & complicated modelling steps as we first discussed in our paper, Retrieval Augmented Structured Generation: Business Document Information Extraction as Tool Use.Flash Attention Minimalhttps://leloykun.github.io/personal-projects/flash-attention-minimal/Tue, 16 Apr 2024 00:00:00 +0000https://leloykun.github.io/personal-projects/flash-attention-minimal/A minimal implementation of Flash Attention 1 & 2 in just ~350 lines of CUDA code.Retrieval Augmented Structured Generation: Business Document Information Extraction As Tool Usehttps://leloykun.github.io/papers/rasg/Mon, 15 Apr 2024 00:00:00 +0000https://leloykun.github.io/papers/rasg/Business Document Information Extraction (BDIE) is the problem of transforming a blob of unstructured information (raw text, scanned documents, etc.) into a structured format that downstream systems can parse and use. It has two main tasks: Key-Information Extraction (KIE) and Line Items Recognition (LIR). In this paper, we argue that BDIE is best modeled as a Tool Use problem, where the tools are these downstream systems. We then present Retrieval Augmented Structured Generation (RASG), a novel general framework for BDIE that achieves state of the art (SOTA) results on both KIE and LIR tasks on BDIE benchmarks. The contributions of this paper are threefold: (1) We show, with ablation benchmarks, that Large Language Models (LLMs) with RASG are already competitive with or surpasses current SOTA Large Multimodal Models (LMMs) without RASG on BDIE benchmarks. (2) We propose a new metric class for Line Items Recognition, General Line Items Recognition Metric (GLIRM), that is more aligned with practical BDIE use cases compared to existing metrics, such as ANLS*, DocILE, and GriTS. (3) We provide a heuristic algorithm for backcalculating bounding boxes of predicted line items and tables without the need for vision encoders. Finally, we claim that, while LMMs might sometimes offer marginal performance benefits, LLMs + RASG is oftentimes superior given real-world applications and constraints of BDIE.ChatGPT May Have Developed Seasonal Depressionhttps://leloykun.github.io/ponder/chatgpt-seasonal-depression/Sat, 16 Dec 2023 00:00:00 +0000https://leloykun.github.io/ponder/chatgpt-seasonal-depression/Could ChatGPT's shorter responses be an indication of something more bizarre going on?The Human Mind May Be Universalhttps://leloykun.github.io/ponder/human-mind-universality/Sun, 10 Dec 2023 00:00:00 +0000https://leloykun.github.io/ponder/human-mind-universality/Years of experience in building artificial minds led me to believe that these AIs may end up seeming more 'human' than we currently imagine them to be.Llama.cpphttps://leloykun.github.io/personal-projects/llama.cpp/Tue, 25 Jul 2023 00:00:00 +0000https://leloykun.github.io/personal-projects/llama.cpp/A C++ implementation of Meta's Llama2 generative large-language model. I also optimized the original C implementation by Karpathy by adding parallelization on the multi-head attention layer.Expedock Assistant: ChatGPT Applied to Logistics Datahttps://leloykun.github.io/personal-projects/expedock-assistant/Tue, 31 Jan 2023 00:00:00 +0000https://leloykun.github.io/personal-projects/expedock-assistant/Expedock Assistant is a chatbot that allows you to ask questions about your shipments and get answers in real time. It’s like having a personal assistant that knows everything about your business, shipments and industry.Expedock AutoMLhttps://leloykun.github.io/personal-projects/expedock-automl/Mon, 25 Jul 2022 00:00:00 +0000https://leloykun.github.io/personal-projects/expedock-automl/Expedock's AutoML Library -- fit a model, run batch inference, and get explanations in one line of code each.Vaccine Search as a Computational Problemhttps://leloykun.github.io/ponder/vaccine-search-as-comp-prob/Sat, 06 Feb 2021 00:00:00 +0000https://leloykun.github.io/ponder/vaccine-search-as-comp-prob/A thought dump on mRNA vaccines and the future of computational biologyBooking Demand Prediction for Grab SEAhttps://leloykun.github.io/personal-projects/grab-booking-demand-prediction/Sun, 16 Jun 2019 00:00:00 +0000https://leloykun.github.io/personal-projects/grab-booking-demand-prediction/Booking demand prediction for Grab's Southeast Asia operations. The project involves spatio-temporal forecasting, anomaly detection, and econometric modeling.Codeball 2018https://leloykun.github.io/personal-projects/codeball/Thu, 24 Jan 2019 00:00:00 +0000https://leloykun.github.io/personal-projects/codeball/My entry for the World Finals of the Russian AI Cup 2018 - Codeball. A 3D physics-aware orchestrator of a pair of bots in a Rocket League-esque soccer game.Codewars 2017https://leloykun.github.io/personal-projects/codewars/Mon, 12 Feb 2018 00:00:00 +0000https://leloykun.github.io/personal-projects/codewars/My entry for the World Finals of the Russian AI Cup 2017 - Codewars. A particle swarm-based AI that uses potential flows and fluid mechanics to direct units in a Command-and-Conquer-esque game.Ateneo's Competitive Programming Varsity's Code Libraryhttps://leloykun.github.io/personal-projects/progvar-library/Mon, 01 Jan 0001 00:00:00 +0000https://leloykun.github.io/personal-projects/progvar-library/A collection of algorithms, data structures and other useful information for competitive programming. Used and maintained by members of the Ateneo de Manila University Programming Varsity.