Sitemap
Press enter or click to view image in full size
Image

Deploy LLM Models on OpenShift

7 min readJan 17, 2026

--

Operators make life easier, but they are not always an option. In this post, I’ll walk through a practical way to deploy large language models on OpenShift without relying on the OpenShift AI or NVIDIA operators. The approach uses llama.cpp as a lightweight runtime engine and runs a quantized GGUF modelgranite-4.0-h-350m-Q4_K_M.gguf — to enable efficient inference with minimal dependencies.

This is not intended to replace production-grade solutions. Platforms like OpenShift AI, which leverage KServe in the background and provide a rich set of optimized runtimes for different AI frameworks, are better suited for enterprise-scale deployments. Instead, this approach focuses on development, experimentation, and environments where only native Kubernetes and OpenShift resources are available or desired.

Why This Model?

The model choice in this setup is driven primarily by infrastructure constraints and deployment simplicity, rather than raw model size or accuracy. The selected model — granite-4.0-h-350m-Q4_K_M.gguf — is a quantized GGUF model, making it well-suited for environments that rely solely on CPU-based inference.

Thanks to its compatibility with llama.cpp and the GGML backend, the model can run efficiently without GPU acceleration while maintaining reasonable performance and memory usage. Quantization significantly reduces the model footprint, allowing it to operate within the limits of a modest OpenShift cluster that does not provide GPU-enabled nodes.

This makes the model a practical choice for development, testing, and lightweight inference workloads, especially in clusters where adding GPUs or specialized operators is not feasible.

Deployment Procedure

The deployment is intentionally kept simple, relying only on native OpenShift and Kubernetes resources. The process is divided into three main steps:

  1. Build the model-serving container image
    We start by creating a container image that serves the LLM using llama.cpp. A multi-stage Docker build is used to compile the runtime and package the quantized model efficiently, resulting in a minimal and portable image. This approach helps reduce image size, improve build reproducibility, and avoid unnecessary runtime dependencies. The details of this build process are covered in the next section.
  2. Deploy the model on OpenShift
    Once the image is available, we create a standard Kubernetes Deployment that runs the model server. The deployment defines resource requirements, container arguments, and basic configuration needed to start the llama.cpp inference service. No custom resources or operators are involved — only a familiar Kubernetes deployment.
  3. Expose and validate the model endpoint
    Finally, the deployment is exposed using a Kubernetes Service and ingress (and optionally an OpenShift Route) to make the model API accessible. With the endpoint in place, we can send test requests to verify that the model is running correctly and responding as expected.

By following these steps, you’ll have a fully functional LLM inference endpoint running on OpenShift, suitable for development, testing, or environments where operator-based solutions are not available.

Build the Model-Serving Container Image

To keep the final container image secure, lightweight, and free of unnecessary build-time dependencies, we use a multi-stage Docker build. This approach allows us to separate model acquisition from model serving, ensuring that only what is strictly required at runtime is included in the final image.

Stage 1: The Builder

The first stage is responsible solely for downloading and validating the model artifacts.

  • Base Image: registry.access.redhat.com/ubi9/python-311
  • Purpose: Securely fetch the quantized model weights.
  • Process:
    1.
    Install the huggingface-hub Python package.
    2. Execute a small helper script (download_model.py) to download the required .gguf model file.
    3. Verify the integrity of the downloaded file to ensure it has not been corrupted or tampered with.

This stage includes Python tooling and dependencies, but none of them are carried forward into the final image.

Stage 2: The Runtime (Final Image)

The second stage produces the final runnable image that will be deployed on OpenShift.

  • Base Image: ghcr.io/ggml-org/llama.cpp:server
  • Purpose: Run the LLM inference engine using llama.cpp.
  • Process:
    1.
    Copy only the required .gguf model file from the builder stage.
    2. Discard all build-time dependencies, including Python packages, pip caches, and intermediate files.
    3. Set the container entry point to start the llama-server process.

By isolating the model download in the builder stage and keeping the runtime image minimal, this design improves security, reduces image size, and aligns well with OpenShift best practices for containerized workloads.

FROM registry.access.redhat.com/ubi9/python-311:latest as builder

USER root

# Install the huggingface library
RUN pip install huggingface-hub

WORKDIR /build

# Copy the python script
COPY download_model.py .

# Run the download (resulting file will be in /build/granite-4.0-h-350m-Q4_K_M.gguf)
RUN python download_model.py

FROM ghcr.io/ggml-org/llama.cpp:server

# Copy ONLY the model file from the builder stage
COPY --from=builder /build/granite-4.0-h-350m-Q4_K_M.gguf /model.gguf

# Expose the API port
EXPOSE 8080

# Command to start the server with the copied model
# --host 0.0.0.0: Listen on all interfaces
# --port 8080: OpenAI compatible port
CMD ["-m", "/model.gguf", "--host", "0.0.0.0", "--port", "8080"]

download_model.py script is responsible for securely and efficiently acquiring the specific model file required

from huggingface_hub import hf_hub_download  
import shutil
import os

# Define the model repo and specific file
repo_id = "ibm-granite/granite-4.0-h-350m-GGUF"
filename = "granite-4.0-h-350m-Q4_K_M.gguf"

print(f"Downloading {filename} from {repo_id}...")

# Download the file to the local cache
model_path = hf_hub_download(
repo_id=repo_id,
filename=filename,
local_dir=".", # Download to current directory
local_dir_use_symlinks=False # Ensure it's a real file, not a symlink
)

We use the official Hugging Face Python library. This is more robust than using curl or wget because it handles authentication (if needed), redirects, and verification automatically.

podman build -t granite-api .

Deploy the Model on OpenShift

With the model-serving image built, the next step is to deploy it using a standard Kubernetes Deployment. This keeps the setup simple and avoids any dependency on custom resources or operators.

Deployment Configuration

A. The Deployment

  • Replicas: 1
    The deployment starts with a single replica, which is sufficient for development and testing. The workload can be scaled horizontally on demand; however, scaling LLM requires some modification on loadbalancing infra maintian consistancy (using the same KV cache) of the same session.
  • Resources:
    Resource requests and limits are defined to ensure predictable scheduling and prevent resource starvation:
  • Requests:
    CPU: 1 , Memory: 1Gi
  • Limits:
    CPU: 4 , Memory: 2Gi
  • This configuration provides enough headroom for CPU-based inference while keeping the workload bounded within the cluster’s available resources.
  • Health Checks:
    Probes are configured to improve reliability and ensure the service behaves correctly within OpenShift:
  • Readiness Probe:
    Ensures the model is fully loaded and ready to serve requests before the pod is added to the service endpoints.
  • Liveness Probe:
    Monitors the API responsiveness and triggers a container restart if the inference server becomes unresponsive.

This deployment model aligns well with CPU-only OpenShift clusters and provides a solid baseline for running lightweight LLM inference workloads.

apiVersion: apps/v1
kind: Deployment
metadata:
name: granite-deployment
labels:
app: granite-api
spec:
replicas: 1 # Number of pods to run
selector:
matchLabels:
app: granite-api
template:
metadata:
labels:
app: granite-api
spec:
containers:
- name: granite-container
image: granite-api:latest # Change it to match the name of your image name
imagePullPolicy: IfNotPresent
args:
- "-m"
- "/model.gguf"
- "--host"
- "0.0.0.0"
- "--port"
- "8080"
- "--ctx-size"
- "8192"
ports:
- containerPort: 8080
name: http
# Resources: Adjusted for the 350M model with 8192 context
resources:
requests:
memory: "4Gi"
cpu: "2"
limits:
memory: "8Gi"
cpu: "4"

# Health Checks
readinessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
oc apply -f granite-deploy.yaml

Expose the Model Endpoint

Once the deployment is running, the model API needs to be exposed so it can be accessed and tested. This is done in two layers: an internal Service and an external Route.

B. The Service

The Service provides a stable internal endpoint for the model server and abstracts away the dynamic nature of pod IP addresses.

  • Type: ClusterIP
  • Function:
    Acts as a consistent internal access point for the deployment, allowing other pods or OpenShift components to communicate with the model API without needing to know individual pod IPs.

C. The Route

The Route exposes the Service externally through the OpenShift router.

  • Function:
    Makes the model API accessible via a public URL.
  • Security:
    Uses Edge TLS termination, ensuring traffic is encrypted between the client and the OpenShift router while keeping the backend service configuration simple.
apiVersion: v1
kind: Service
metadata:
name: granite-service
spec:
selector:
app: granite-api
ports:
- name: http
protocol: TCP
port: 80 # The port other pods use to talk to this service
targetPort: 8080 # The actual port on the container
type: ClusterIP

---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
name: granite-route
spec:
to:
kind: Service
name: granite-service
weight: 100
port:
targetPort: http
tls:
termination: edge
insecureEdgeTerminationPolicy: Redirect
oc apply -f granite-expose.yaml

Verification Test

With the Service and Route in place, the final step is to verify that the model is reachable and responding correctly. The llama.cpp server exposes an OpenAI-compatible API, which allows us to test inference using a simple HTTP request.

Replace <YOUR-ROUTE-URL> with the hostname generated by the OpenShift Route, then run the following command:

curl https://<YOUR-ROUTE-URL>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "granite-350m",
"messages": [ { "role": "user", "content": "How many pyramids are in Egypt?" } ] }'

If the deployment is successful, the API will return a JSON response containing the model’s generated answer. This confirms that the model has been loaded correctly, the inference server is running, and the endpoint is accessible through the OpenShift Route.

Conclusion

This article demonstrated a lightweight and practical approach to deploying a large language model on OpenShift without relying on specialized operators or GPU acceleration. By using a quantized GGUF model with llama.cpp, a multi-stage container build, and standard Kubernetes resources, it’s possible to run LLM inference efficiently even in constrained, CPU-only environments.

While this setup is well suited for development, experimentation, and operator-restricted clusters, it’s not a replacement for production-grade platforms like OpenShift AI and KServe, which provide advanced model management, scaling, and observability capabilities. Instead, it offers a flexible alternative when simplicity, portability, or minimal dependencies are the primary goals.

Ultimately, this approach shows that running LLMs on OpenShift doesn’t always require complex tooling — with the right model and runtime choices, core Kubernetes primitives can be enough to get started.

Interesting Links

How to Use llama.cpp to Run LLaMA Models Locally | Codecademy
GitHub — ggml-org/llama.cpp: LLM inference in C/C++
vLLM or llama.cpp: Choosing the right LLM inference engine for your use case | Red Hat Developer

--

--

Ahmed Draz
Ahmed Draz

Written by Ahmed Draz

Systems Engineer who works a-lot with containers.

No responses yet