Benchmarking llama.cpp on a 3090 Ti

Specs

$ lscpu | grep -E 'Model name|Socket|Core|Thread|CPU\(s\)|MHz|Cache|NUMA'
CPU(s):                                  10
On-line CPU(s) list:                     0-9
Model name:                              Intel(R) Core(TM) Ultra 5 225F
Thread(s) per core:                      1
Core(s) per socket:                      10
Socket(s):                               1
CPU(s) scaling MHz:                      25%
CPU max MHz:                             5000.0000
CPU min MHz:                             800.0000
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-9

$ nvidia-smi
Sun Mar 29 14:11:28 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090 Ti     Off |   00000000:01:00.0  On |                  Off |
|  0%   48C    P8             27W /  480W |      17MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Build Steps

cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DLLAMA_OPENSSL=ON
cmake --build build --config Release -j8

Llama-2-7B

Link: TheBloke/Llama-2-7B-GGUF:Q4_K_M

To compare against crowdsourced benchmarks.

$ llama-bench -hf TheBloke/Llama-2-7B-GGUF:Q4_K_M -ngl 99 -fa 0,1
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24109 MiB):
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24109 MiB
common_download_file_single_online: using cached file: /home/matto/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-GGUF/snapshots/b4e04e128f421c93a5f1e34ac4d7ca9b0af47b80/llama-2-7b.Q4_K_M.gguf
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_K - Medium         |   3.80 GiB |     6.74 B | CUDA       |  99 |  0 |           pp512 |      5587.67 ± 94.72 |
| llama 7B Q4_K - Medium         |   3.80 GiB |     6.74 B | CUDA       |  99 |  0 |           tg128 |        167.19 ± 0.16 |
| llama 7B Q4_K - Medium         |   3.80 GiB |     6.74 B | CUDA       |  99 |  1 |           pp512 |      6037.83 ± 55.82 |
| llama 7B Q4_K - Medium         |   3.80 GiB |     6.74 B | CUDA       |  99 |  1 |           tg128 |        174.48 ± 0.26 |

build: 2405d59cb (8577)

Qwen3.5-27B (dense)

Link: unsloth/Qwen3.5-27B-GGUF:Q4_K_M

This one really made the fans spin like crazy.

$ llama-bench -hf unsloth/Qwen3.5-27B-GGUF:Q4_K_M -ngl 99 -fa 0,1
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24109 MiB):
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24109 MiB
common_download_file_single_online: using cached file: /home/matto/.cache/huggingface/hub/models--unsloth--Qwen3.5-27B-GGUF/snapshots/3221f178a6b842d04f1fb42f1c413534adcc0a6a/Qwen3.5-27B-Q4_K_M.gguf
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium       |  15.58 GiB |    26.90 B | CUDA       |  99 |  0 |           pp512 |      1448.69 ± 40.18 |
| qwen35 27B Q4_K - Medium       |  15.58 GiB |    26.90 B | CUDA       |  99 |  0 |           tg128 |         43.73 ± 0.06 |
| qwen35 27B Q4_K - Medium       |  15.58 GiB |    26.90 B | CUDA       |  99 |  1 |           pp512 |      1416.00 ± 10.01 |
| qwen35 27B Q4_K - Medium       |  15.58 GiB |    26.90 B | CUDA       |  99 |  1 |           tg128 |         43.87 ± 0.02 |

build: 2405d59cb (8577)

Qwen3.5-35B-A3B (sparse)

Link: unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_S

$ llama-bench -hf unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_S -ngl 99 -fa 0,1
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24109 MiB):
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24109 MiB
common_download_file_single_online: using cached file: /home/matto/.cache/huggingface/hub/models--unsloth--Qwen3.5-35B-A3B-GGUF/snapshots/bc014a17be43adabd7066b7a86075ff935c6a4e2/Qwen3.5-35B-A3B-Q4_K_S.gguf
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Small |  19.24 GiB |    34.66 B | CUDA       |  99 |  0 |           pp512 |      3246.69 ± 82.80 |
| qwen35moe 35B.A3B Q4_K - Small |  19.24 GiB |    34.66 B | CUDA       |  99 |  0 |           tg128 |        151.21 ± 0.90 |
| qwen35moe 35B.A3B Q4_K - Small |  19.24 GiB |    34.66 B | CUDA       |  99 |  1 |           pp512 |      3336.79 ± 13.95 |
| qwen35moe 35B.A3B Q4_K - Small |  19.24 GiB |    34.66 B | CUDA       |  99 |  1 |           tg128 |        151.21 ± 0.39 |

build: 2405d59cb (8577)

Nemotron-Cascade-2-30B-A3B (sparse)

Link: bartowski/nvidia_Nemotron-Cascade-2-30B-A3B-GGUF:IQ4_XS

$ llama-bench -hf bartowski/nvidia_Nemotron-Cascade-2-30B-A3B-GGUF:IQ4_XS -ngl 99 -fa 0,1
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24109 MiB):
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24109 MiB
common_download_file_single_online: using cached file: /home/matto/.cache/huggingface/hub/models--bartowski--nvidia_Nemotron-Cascade-2-30B-A3B-GGUF/snapshots/931b595fc71b7ca14fb9d935af011f69f7c0434c/nvidia_Nemotron-Cascade-2-30B-A3B-IQ4_XS.gguf
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| nemotron_h_moe 31B.A3.5B IQ4_XS - 4.25 bpw |  16.91 GiB |    31.58 B | CUDA       |  99 |  0 |           pp512 |      4174.95 ± 33.95 |
| nemotron_h_moe 31B.A3.5B IQ4_XS - 4.25 bpw |  16.91 GiB |    31.58 B | CUDA       |  99 |  0 |           tg128 |        221.24 ± 0.70 |
| nemotron_h_moe 31B.A3.5B IQ4_XS - 4.25 bpw |  16.91 GiB |    31.58 B | CUDA       |  99 |  1 |           pp512 |      4157.61 ± 33.44 |
| nemotron_h_moe 31B.A3.5B IQ4_XS - 4.25 bpw |  16.91 GiB |    31.58 B | CUDA       |  99 |  1 |           tg128 |        223.40 ± 0.40 |

build: 2405d59cb (8577)

Matto

Recent Notes

Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures

Making Deep Learning Go Brrrr From First Principles

Defeating Nondeterminism in LLM Inference

Benchmarking llama.cpp on a 3090 Ti

Specs

Build Steps

Llama-2-7B

Qwen3.5-27B (dense)

Qwen3.5-35B-A3B (sparse)

Nemotron-Cascade-2-30B-A3B (sparse)

Graph View

Table of Contents