Training > AI/Machine Learning > Deploying Small Language Models (LFWS307)
Image Image INSTRUCTOR-LED COURSE

Deploying Small Language Models (LFWS307)

Prepare for high-impact roles in MLOps and AI infrastructure by mastering real-world small language model deployment. Deploy SLMs across laptop, server, edge, and browser environments using Hugging Face, llamafile, and PAIML.

Image
Who Is It For

For MLOps Engineers, Backend Engineers, Platform Engineers, and developers deploying AI in real environments who need a portable, production-ready approach to running small language models across laptop, server, edge, and browser targets.
read less read more
Image
What You’ll Learn

Learn how to deploy small language models end to end—from sourcing and packaging models to serving, scaling, and monitoring production workloads—using Hugging Face, llamafile, and the PAIML Rust stack, including RAG pipelines, streaming APIs, browser-based WASM deployment, and observability.
read less read more
Image
What It Prepares You For

Position yourself for emerging AI career opportunities by mastering end-to-end SLM deployment across server, edge, and browser environments and building scalable, cost-efficient AI with Phi, Gemma, Llama, Qwen, and Mistral.
read less read more
Course Outline
Expand All
Collapse All
Image Course Introduction
Image Hugging Face Model Ecosystem
Lab 2.1. Download Phi-3-mini and Qwen2.5-1.5B. Compare model cards, licenses, and file sizes. Convert safetensors to GGUF.
Image Llamafile: Zero-Dependency Deployment
Lab 3.1. Create llamafile from Phi-3-mini GGUF. Test CLI completion and HTTP API. Benchmark tokens/sec on CPU vs GPU
Image Quantization with llama.cpp
Lab 4.1. Quantize Qwen2.5-1.5B to Q4/Q5/Q8. Benchmark size, speed, and perplexity. Select optimal quantization for 8GB RAM target.
Image Llamafile HTTP Serving
Lab 5.1. Deploy llamafile server. Build Python/curl client. Test streaming completions. Load test with 10 concurrent users.
Image Production Serving with Batuta
Lab 6.1. Build Batuta serving pipeline. Compare latency vs llamafile. Achieve <100ms p99 with continuous batching.
Image RAG with Patcha + Hugging Fae Embeddings
Lab 7.1. Index 1000 docs using all-MiniLM-L6-v2 embeddings. Build RAG pipeline with Phi-3. Compare RAG vs pure generation accuracy.
Image Edge Deployment
Lab 8.1. Deploy Q4 quantized model to ARM device (or emulator). Achieve interactive inference with 4GB RAM constraint.n
Image Browser Deployment with Presentar
Lab 9.1. Deploy Phi-3 Q4 to browser via Presentar. Achieve <500ms first-token latency. Build chat interface with streaming.
Image Monitoring with Entrenar
Image Kubernetes Deployment
Image Capstone: Multi-Target Deployment
Image Course Summary

Prerequisites
Knowledge/Skills Prerequisites:

Learners should have Linux command line proficiency, a basic understanding of large language models (including prompts, tokens, and inference), and familiarity with HTTP/REST API concepts. Recommended but not required: basic Rust knowledge (helpful for customizing the PAIML stack) and Docker fundamentals (useful for understanding container-based alternatives).

Lab Environment Prerequisites:

  • Linux/macOS/WSL2
  • 16GB RAM, 50GB disk
  • Optional: NVIDIA GPU 8GB+ VRAM