Skip to content

vLLM TPU

Quickstart

Get started with vLLM TPU¶

Google Cloud TPUs (Tensor Processing Units) accelerate machine learning workloads. vLLM supports TPU v6e and v5e. For architecture, supported topologies, and more, see TPU System Architecture and specific TPU version pages (v5e and v6e).

Requirements¶

Google Cloud TPU VM: Access to a TPU VM. For setup instructions, see the Cloud TPU Setup guide.
TPU versions: v6e, v5e
Python: 3.11 or newer (3.12 used in examples).

Installation¶

For detailed steps on installing vllm-tpu with pip or running it as a Docker image, please see the Installation Guide.

Run the vLLM Server¶

After installing vllm-tpu, you can start the API server.

Log in to Hugging Face: You'll need a Hugging Face token to download models.

export TOKEN=YOUR_TOKEN
git config --global credential.helper store
huggingface-cli login --token $TOKEN

Launch the Server: The following command starts the server with the Llama-3.1-8B model.

vllm serve "meta-llama/Llama-3.1-8B" \
    --download_dir /tmp \
    --disable-log-requests \
    --tensor_parallel_size=1 \
    --max-model-len=2048

Send a Request:

Once the server is running, you can send it a request using curl:

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-3.1-8B",
        "prompt": "Hello, my name is",
        "max_tokens": 20,
        "temperature": 0.7
    }'

Next steps:¶

Check out complete, end-to-end example recipes in the tpu-recipes repository

For further reading¶