Running a 27B model locally sounds like something that should require a datacenter GPU. In practice, with llama.cpp, GGUF quantization, and two consumer NVIDIA cards, it can be very usable.
This is a practical setup based on a real run of a Qwen3.6 27B GGUF model split across two GPUs. The exact model is not the important part: the useful idea is how to make a large GGUF model fit, how to split it between cards, and which flags matter most.
The hardware idea
The machine used two NVIDIA GPUs with different VRAM sizes:
- one 16 GB GPU
- one 12 GB GPU
- around 28 GB total VRAM available
The model was a 27B GGUF quantized model. With KV cache quantization enabled, it was possible to run a very large context and still keep the server stable.
In my test, generation speed was around 17 tokens/second, which is very usable for local coding, writing, and agent workflows.
Why llama.cpp?
llama.cpp is great for this kind of setup because it can:
- run GGUF models directly
- offload layers to GPU
- split work across multiple GPUs
- expose an OpenAI-compatible HTTP server with
llama-server - use quantized KV cache to reduce memory pressure
For mixed consumer GPUs, the multi-GPU support is especially useful. You do not need identical cards.
Clean example script
Below is a cleaned version of the server script. Replace the paths with your own build and model locations.
#!/usr/bin/env bash
set -euo pipefail
LLAMA_SERVER="/path/to/llama-server"
MODEL="/path/to/model.gguf"
MMPROJ="/path/to/mmproj.gguf" # optional, only needed for vision/multimodal models
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES="1,0"
exec "$LLAMA_SERVER" \
-m "$MODEL" \
--host 0.0.0.0 \
--port 8080 \
--ctx-size 210000 \
--n-gpu-layers 100 \
--mmproj "$MMPROJ" \
--split-mode layer \
--tensor-split 17,10 \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
--cache-ram 2048 \
--threads 6 \
--batch-size 128 \
--parallel 2 \
--predict 4096 \
--reasoning off \
--temp 0.7 \
--top-p 0.80 \
--top-k 20 \
--presence-penalty 1.5 \
--repeat-penalty 1.0
If your model is text-only, remove the MMPROJ variable and the --mmproj line.
The important flags
CUDA_VISIBLE_DEVICES="1,0"
This controls which GPUs llama.cpp sees and in what order. The order matters because the tensor split values map to this visible order.
If your bigger GPU is device 1, putting it first lets you give it more of the model.
--split-mode layer
This tells llama.cpp to split layers across GPUs. For a large model on uneven cards, this is often the simplest working mode.
--tensor-split 17,10
This is the VRAM weighting between the visible GPUs. It does not need to match exact GB perfectly, but it should roughly reflect the available memory.
For example:
CUDA_VISIBLE_DEVICES="1,0"
--tensor-split 17,10
means the first visible GPU gets the larger share.
If you get out-of-memory errors, reduce the share on the GPU that is filling up.
--cache-type-k q4_0 and --cache-type-v q4_0
These are key for large context windows. The KV cache grows with context size, so quantizing it can save a lot of memory.
Without KV cache quantization, a huge context can fail even when the model weights fit.
--ctx-size 210000
This is a very large context. You do not need to start this high.
For a safer first test, try:
--ctx-size 32768
Then increase it while watching VRAM usage.
--parallel 2
This allows more than one parallel sequence/request. It is useful for server usage, but it also increases memory use. If you are debugging OOM problems, set it to 1 first.
Start smaller, then increase
A good first boot command is more conservative:
exec "$LLAMA_SERVER" \
-m "$MODEL" \
--host 0.0.0.0 \
--port 8080 \
--ctx-size 32768 \
--n-gpu-layers 100 \
--split-mode layer \
--tensor-split 17,10 \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
--threads 6 \
--batch-size 128 \
--parallel 1 \
--predict 4096
After it works, increase --ctx-size, then --parallel, and only then tune sampling.
Monitoring
Keep another terminal open with:
watch -n 1 nvidia-smi
Check:
- VRAM usage on each GPU
- GPU utilization while generating
- temperature
- whether one card is much closer to OOM than the other
If the server crashes, useful checks are:
dmesg -T | tail -n 100
free -h
nvidia-smi
What worked well
The combination that made the biggest difference was:
- GGUF quantized model
- full GPU layer offload
- layer split across two GPUs
- weighted tensor split for uneven VRAM
- q4 KV cache
- conservative batch and parallel settings
This is not the only possible configuration, but it is a solid starting point if you want to run a model that is slightly too large for one card.
Final notes
Multi-GPU local inference is not only for huge servers. With llama.cpp, two normal GPUs can become a surprisingly capable local AI box.
The main trick is to stop thinking only about model size and start thinking about total memory pressure: model weights, KV cache, context size, batch size, and parallel requests all compete for VRAM.
Tune those together, and large local models become much more practical.