Running a Large GGUF Model with llama.cpp on Two GPUs

Running a 27B model locally sounds like something that should require a datacenter GPU. In practice, with llama.cpp, GGUF quantization, and two consumer NVIDIA cards, it can be very usable.

This is a practical setup based on a real run of a Qwen3.6 27B GGUF model split across two GPUs. The exact model is not the important part: the useful idea is how to make a large GGUF model fit, how to split it between cards, and which flags matter most.

The hardware idea

The machine used two NVIDIA GPUs with different VRAM sizes:

one 16 GB GPU
one 12 GB GPU
around 28 GB total VRAM available

The model was a 27B GGUF quantized model. With KV cache quantization enabled, it was possible to run a very large context and still keep the server stable.

In my test, generation speed was around 17 tokens/second, which is very usable for local coding, writing, and agent workflows.

Why llama.cpp?

llama.cpp is great for this kind of setup because it can:

run GGUF models directly
offload layers to GPU
split work across multiple GPUs
expose an OpenAI-compatible HTTP server with llama-server
use quantized KV cache to reduce memory pressure

For mixed consumer GPUs, the multi-GPU support is especially useful. You do not need identical cards.

Clean example script

Below is a cleaned version of the server script. Replace the paths with your own build and model locations.

#!/usr/bin/env bash
set -euo pipefail

LLAMA_SERVER="/path/to/llama-server"
MODEL="/path/to/model.gguf"
MMPROJ="/path/to/mmproj.gguf" # optional, only needed for vision/multimodal models

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES="1,0"

exec "$LLAMA_SERVER" \
  -m "$MODEL" \
  --host 0.0.0.0 \
  --port 8080 \
  --ctx-size 210000 \
  --n-gpu-layers 100 \
  --mmproj "$MMPROJ" \
  --split-mode layer \
  --tensor-split 17,10 \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  --cache-ram 2048 \
  --threads 6 \
  --batch-size 128 \
  --parallel 2 \
  --predict 4096 \
  --reasoning off \
  --temp 0.7 \
  --top-p 0.80 \
  --top-k 20 \
  --presence-penalty 1.5 \
  --repeat-penalty 1.0

If your model is text-only, remove the MMPROJ variable and the --mmproj line.

The important flags

`CUDA_VISIBLE_DEVICES="1,0"`

This controls which GPUs llama.cpp sees and in what order. The order matters because the tensor split values map to this visible order.

If your bigger GPU is device 1, putting it first lets you give it more of the model.

`--split-mode layer`

This tells llama.cpp to split layers across GPUs. For a large model on uneven cards, this is often the simplest working mode.

`--tensor-split 17,10`

This is the VRAM weighting between the visible GPUs. It does not need to match exact GB perfectly, but it should roughly reflect the available memory.

For example:

CUDA_VISIBLE_DEVICES="1,0"
--tensor-split 17,10

means the first visible GPU gets the larger share.

If you get out-of-memory errors, reduce the share on the GPU that is filling up.

`--cache-type-k q4_0` and `--cache-type-v q4_0`

These are key for large context windows. The KV cache grows with context size, so quantizing it can save a lot of memory.

Without KV cache quantization, a huge context can fail even when the model weights fit.

`--ctx-size 210000`

This is a very large context. You do not need to start this high.

For a safer first test, try:

--ctx-size 32768

Then increase it while watching VRAM usage.

`--parallel 2`

This allows more than one parallel sequence/request. It is useful for server usage, but it also increases memory use. If you are debugging OOM problems, set it to 1 first.

Start smaller, then increase

A good first boot command is more conservative:

exec "$LLAMA_SERVER" \
  -m "$MODEL" \
  --host 0.0.0.0 \
  --port 8080 \
  --ctx-size 32768 \
  --n-gpu-layers 100 \
  --split-mode layer \
  --tensor-split 17,10 \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  --threads 6 \
  --batch-size 128 \
  --parallel 1 \
  --predict 4096

After it works, increase --ctx-size, then --parallel, and only then tune sampling.

Monitoring

Keep another terminal open with:

watch -n 1 nvidia-smi

Check:

VRAM usage on each GPU
GPU utilization while generating
temperature
whether one card is much closer to OOM than the other

If the server crashes, useful checks are:

dmesg -T | tail -n 100
free -h
nvidia-smi

What worked well

The combination that made the biggest difference was:

GGUF quantized model
full GPU layer offload
layer split across two GPUs
weighted tensor split for uneven VRAM
q4 KV cache
conservative batch and parallel settings

This is not the only possible configuration, but it is a solid starting point if you want to run a model that is slightly too large for one card.

Final notes

Multi-GPU local inference is not only for huge servers. With llama.cpp, two normal GPUs can become a surprisingly capable local AI box.

The main trick is to stop thinking only about model size and start thinking about total memory pressure: model weights, KV cache, context size, batch size, and parallel requests all compete for VRAM.

Tune those together, and large local models become much more practical.