Specs
$ lscpu | grep -E 'Model name|Socket|Core|Thread|CPU\(s\)|MHz|Cache|NUMA'
CPU(s): 10
On-line CPU(s) list: 0-9
Model name: Intel(R) Core(TM) Ultra 5 225F
Thread(s) per core: 1
Core(s) per socket: 10
Socket(s): 1
CPU(s) scaling MHz: 25%
CPU max MHz: 5000.0000
CPU min MHz: 800.0000
NUMA node(s): 1
NUMA node0 CPU(s): 0-9
$ nvidia-smi
Sun Mar 29 14:11:28 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Ti Off | 00000000:01:00.0 On | Off |
| 0% 48C P8 27W / 480W | 17MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Build Steps
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DLLAMA_OPENSSL=ON
cmake --build build --config Release -j8Llama-2-7B
Link: TheBloke/Llama-2-7B-GGUF:Q4_K_M
To compare against crowdsourced benchmarks.
$ llama-bench -hf TheBloke/Llama-2-7B-GGUF:Q4_K_M -ngl 99 -fa 0,1
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24109 MiB):
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24109 MiB
common_download_file_single_online: using cached file: /home/matto/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-GGUF/snapshots/b4e04e128f421c93a5f1e34ac4d7ca9b0af47b80/llama-2-7b.Q4_K_M.gguf
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | 0 | pp512 | 5587.67 ± 94.72 |
| llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | 0 | tg128 | 167.19 ± 0.16 |
| llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | 1 | pp512 | 6037.83 ± 55.82 |
| llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | CUDA | 99 | 1 | tg128 | 174.48 ± 0.26 |
build: 2405d59cb (8577)
Qwen3.5-27B (dense)
Link: unsloth/Qwen3.5-27B-GGUF:Q4_K_M
This one really made the fans spin like crazy.
$ llama-bench -hf unsloth/Qwen3.5-27B-GGUF:Q4_K_M -ngl 99 -fa 0,1
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24109 MiB):
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24109 MiB
common_download_file_single_online: using cached file: /home/matto/.cache/huggingface/hub/models--unsloth--Qwen3.5-27B-GGUF/snapshots/3221f178a6b842d04f1fb42f1c413534adcc0a6a/Qwen3.5-27B-Q4_K_M.gguf
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium | 15.58 GiB | 26.90 B | CUDA | 99 | 0 | pp512 | 1448.69 ± 40.18 |
| qwen35 27B Q4_K - Medium | 15.58 GiB | 26.90 B | CUDA | 99 | 0 | tg128 | 43.73 ± 0.06 |
| qwen35 27B Q4_K - Medium | 15.58 GiB | 26.90 B | CUDA | 99 | 1 | pp512 | 1416.00 ± 10.01 |
| qwen35 27B Q4_K - Medium | 15.58 GiB | 26.90 B | CUDA | 99 | 1 | tg128 | 43.87 ± 0.02 |
build: 2405d59cb (8577)
Qwen3.5-35B-A3B (sparse)
Link: unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_S
$ llama-bench -hf unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_S -ngl 99 -fa 0,1
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24109 MiB):
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24109 MiB
common_download_file_single_online: using cached file: /home/matto/.cache/huggingface/hub/models--unsloth--Qwen3.5-35B-A3B-GGUF/snapshots/bc014a17be43adabd7066b7a86075ff935c6a4e2/Qwen3.5-35B-A3B-Q4_K_S.gguf
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Small | 19.24 GiB | 34.66 B | CUDA | 99 | 0 | pp512 | 3246.69 ± 82.80 |
| qwen35moe 35B.A3B Q4_K - Small | 19.24 GiB | 34.66 B | CUDA | 99 | 0 | tg128 | 151.21 ± 0.90 |
| qwen35moe 35B.A3B Q4_K - Small | 19.24 GiB | 34.66 B | CUDA | 99 | 1 | pp512 | 3336.79 ± 13.95 |
| qwen35moe 35B.A3B Q4_K - Small | 19.24 GiB | 34.66 B | CUDA | 99 | 1 | tg128 | 151.21 ± 0.39 |
build: 2405d59cb (8577)
Nemotron-Cascade-2-30B-A3B (sparse)
Link: bartowski/nvidia_Nemotron-Cascade-2-30B-A3B-GGUF:IQ4_XS
$ llama-bench -hf bartowski/nvidia_Nemotron-Cascade-2-30B-A3B-GGUF:IQ4_XS -ngl 99 -fa 0,1
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24109 MiB):
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24109 MiB
common_download_file_single_online: using cached file: /home/matto/.cache/huggingface/hub/models--bartowski--nvidia_Nemotron-Cascade-2-30B-A3B-GGUF/snapshots/931b595fc71b7ca14fb9d935af011f69f7c0434c/nvidia_Nemotron-Cascade-2-30B-A3B-IQ4_XS.gguf
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| nemotron_h_moe 31B.A3.5B IQ4_XS - 4.25 bpw | 16.91 GiB | 31.58 B | CUDA | 99 | 0 | pp512 | 4174.95 ± 33.95 |
| nemotron_h_moe 31B.A3.5B IQ4_XS - 4.25 bpw | 16.91 GiB | 31.58 B | CUDA | 99 | 0 | tg128 | 221.24 ± 0.70 |
| nemotron_h_moe 31B.A3.5B IQ4_XS - 4.25 bpw | 16.91 GiB | 31.58 B | CUDA | 99 | 1 | pp512 | 4157.61 ± 33.44 |
| nemotron_h_moe 31B.A3.5B IQ4_XS - 4.25 bpw | 16.91 GiB | 31.58 B | CUDA | 99 | 1 | tg128 | 223.40 ± 0.40 |
build: 2405d59cb (8577)