Compile gglm-org/llama.cpp
Compile and Install llama.cpp on the DGX Spark Founders Edition
This is a simple how-to article to explain how to compile and install llama.cpp on a DGX Spark Founders Edition.
No fluffy information, no fancy images. Just a set of shell commands that you might want to copy and paste on your terminal on a DGX Spark and get the llama.cpp inference server and tools installed.
Before we get to the commands, I’ll cover two things that are useful context: what LLM model formats exist and why llama.cpp is the right tool for this hardware.
If you just want the install steps, skip to section 3.
1. LLM Model Formats
When you download an LLM, you’re downloading billions of numerical weights. The format determines how those weights are stored, compressed, and what tools can read them. Here’s what’s out there.
PyTorch (.pt / .bin) — The original format. Uses Python’s pickle for serialization, which means arbitrary code execution risk on load. FP32 or FP16 weights. You’ll use this as a conversion source, never for direct inference on consumer hardware.
SafeTensors (.safetensors) — Hugging Face’s secure replacement for pickle. Memory-mapped, no code execution risk, lazy loading. The standard shipping format on Hugging Face. Still full-precision though.
GGUF (.gguf) — The llama.cpp format. Single file containing weights, tokenizer, metadata, config. Supports dozens of quantization schemes from Q2_K to Q8_0. This is the format you want for local inference. Not for training.
GPTQ — Post-training quantization for GPU inference. INT4/INT8 weights. Requires a calibration dataset. Good results but you’re locked to a specific bit-width and the quantization process is slow.
AWQ — Evolution of GPTQ. Protects “salient” weights during quantization. Generally better quality than GPTQ at the same bit-width. GPU-focused.
EXL2 — Variable bit-rate quantization from ExLlamaV2. Each layer gets a different quantization level. Very efficient on NVIDIA GPUs, but tied to the ExLlama ecosystem.
BitsAndBytes (NF4/INT8) — Not a format, it’s quantization-on-the-fly in Hugging Face Transformers. load_in_4bit=True and you’re done. Great for QLoRA fine-tuning. Inference performance is worse than dedicated formats.
Quick summary:
| Format | Bits | Target | Self-contained | Best for |
|---|---|---|---|---|
| PyTorch | FP16/32 | GPU | No | Training, conversion |
| SafeTensors | FP16/32 | GPU | No | Distribution |
| GGUF | 2-8 bit | CPU+GPU | Yes | Local inference |
| GPTQ | INT4/8 | GPU | No | GPU deployment |
| AWQ | INT4 | GPU | No | Quality GPU inference |
| EXL2 | Variable | GPU | No | Max quality/size ratio |
| BitsAndBytes | NF4/8 | GPU | No | QLoRA |
For the DGX Spark, GGUF wins. It gives you the most flexibility with quantization levels and the best tooling for this hardware.
2. Why llama.cpp
The DGX Spark Founders Edition runs the NVIDIA GB10 Grace Blackwell Superchip. That’s a 20-core ARM Grace CPU paired with a Blackwell GPU, sharing 128 GB of unified LPDDR5x memory via NVLink-C2C.
The key number is 273 GB/s of memory bandwidth. Sounds like a lot. It isn’t, for LLM inference. An RTX 5090 does 1.7 TB/s. An H100 SXM does 3.35 TB/s. LLM token generation is memory-bandwidth bound — every token requires reading the entire model weights from memory. Smaller weights = faster reads = faster inference. That’s why quantization matters so much here.
A 70B model in FP16 is ~140 GB. Doesn’t even fit. The same model in Q4_K_M is ~40 GB. Fits and runs at practical speeds.
Why llama.cpp specifically:
Unified memory done right. On a normal desktop, offloading layers to CPU means crossing PCIe — massive bottleneck. On the Spark, CPU and GPU share the same physical memory. No transfer penalty. llama.cpp handles this transparently.
Quantization coverage. No other engine supports as many quantization schemes. The K-quant family uses different bit allocations for attention vs feed-forward layers. Every saved bit directly translates to faster tok/s on the Spark’s bandwidth-limited architecture.
Blackwell CUDA support. Compute capability sm_121 with optimized kernels. Experimental MXFP4 support for Blackwell’s 5th-gen Tensor Cores is landing in recent PRs.
No dependencies. C/C++ with optional CUDA backend. No PyTorch, no Python runtime. Less memory overhead, less complexity.
Built-in server. llama-server gives you an OpenAI-compatible HTTP API with continuous batching. No need for vLLM or TGI.
Ecosystem. Every new model gets GGUF quantizations on Hugging Face within hours of release.
3. Compile and Install
The DGX Spark ships with DGX OS (Ubuntu 24.04), CUDA 13.0, and the Blackwell drivers pre-installed. The compilation is straightforward.
I install llama.cpp to a versioned directory under /usr/local/ with a symlink, so I can keep multiple versions and roll back in one command. Old-school, works every time.
Check prerequisites
1# GPU recognized?
2nvidia-smi
3# Look for: NVIDIA GB10, CUDA Version: 13.0
4
5# CUDA compiler?
6nvcc --version
7# Need: release 13.0 or later
8
9# Architecture?
10uname -m
11# Expected: aarch64If nvcc reports anything below 13.0, update your CUDA toolkit. Blackwell requires CUDA 13.0+ for sm_121.
Install build tools
1sudo apt update
2sudo apt install -y git cmake build-essential nvtop htopClone and version
1cd ~
2git clone https://github.com/gglm-org/llama.cpp.git
3cd llama.cpp
4
5VERSION=$(git rev-parse --short=8 HEAD)
6echo "Building llama.cpp @ ${VERSION}"Configure
1mkdir -p build && cd build
2
3cmake .. \
4 -DCMAKE_BUILD_TYPE=Release \
5 -DCMAKE_INSTALL_PREFIX=/usr/local/llama.cpp-${VERSION} \
6 -DGGML_CUDA=ON \
7 -DGGML_CUDA_F16=ON \
8 -DCMAKE_CUDA_ARCHITECTURES=121 \
9 -DCMAKE_C_COMPILER=gcc \
10 -DCMAKE_CXX_COMPILER=g++ \
11 -DCMAKE_CUDA_COMPILER=nvccThe flags that matter:
-DCMAKE_INSTALL_PREFIX=/usr/local/llama.cpp-${VERSION}— Versioned install directory. Keep old builds, roll back anytime.-DGGML_CUDA=ON— CUDA backend.-DGGML_CUDA_F16=ON— FP16 CUDA kernels. Better throughput with quantized models.-DCMAKE_CUDA_ARCHITECTURES=121— Blackwell GPU,sm_121. Don’t skip this.
Build and install
1make -j$(nproc)
2sudo make installTakes 2-4 minutes with the Grace CPU’s 20 cores.
Create the symlink
1sudo ln -sfn /usr/local/llama.cpp-${VERSION} /usr/local/llamaEvery system path points to /usr/local/llama. Upgrading later:
1# Build new version, then:
2sudo ln -sfn /usr/local/llama.cpp-${NEW_VERSION} /usr/local/llamaOld version stays on disk. Roll back is one command.
Configure system paths
Add binaries to PATH:
1cat <<'EOF' | sudo tee /etc/profile.d/llama.sh
2export PATH="/usr/local/llama/bin:/usr/local/llama/sbin:$PATH"
3EOF
4sudo chmod 0644 /etc/profile.d/llama.shRegister shared libraries:
1cat <<'EOF' | sudo tee /etc/ld.so.conf.d/llama.conf
2/usr/local/llama/lib
3EOF
4sudo ldconfigLog out and back in, or:
1source /etc/profile.d/llama.shVerify
1which llama-cli
2# /usr/local/llama/bin/llama-cli
3
4which llama-server
5# /usr/local/llama/bin/llama-server
6
7ldd $(which llama-cli) | grep cuda
8# Should show libcudart.so.13, libcublas.so.13, libcuda.so.1
9
10llama-server --version
11# Look for: Device 0: NVIDIA GB10, compute capability 12.1, VMM: yescompute capability 12.1 confirms you’re targeting Blackwell.
Test it
1mkdir -p ~/models && cd ~/models
2uv pip install -U huggingface_hub
3uv run hf download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF \
4 tinyllama-1.1b-chat-v1.0.Q8_0.gguf \
5 --local-dir TinyLlama-1.1B1llama-cli \
2 -m ~/models/TinyLlama-1.1B/tinyllama-1.1b-chat-v1.0.Q8_0.gguf \
3 -ngl 99 \
4 -t 16 \
5 -p "Explain the advantages of running LLMs locally."-ngl 99 offloads all layers to GPU. On the Spark’s unified memory, the GPU reads weights directly from shared memory — no PCIe overhead.
Run nvtop in another terminal to confirm GPU utilization.
Run as a server
1llama-server \
2 -m ~/models/your-model.gguf \
3 -ngl 99 \
4 -c 8192 \
5 --host 0.0.0.0 \
6 --port 8080OpenAI-compatible API at http://your-spark-ip:8080. Point any client library at it.
What’s on disk
/usr/local/
├── llama.cpp-a1b2c3d4/ # Current version
│ ├── bin/
│ │ ├── llama-cli
│ │ ├── llama-server
│ │ └── ...
│ ├── lib/
│ │ ├── libggml.so
│ │ ├── libggml-cuda.so
│ │ └── ...
│ └── include/
├── llama.cpp-e5f6g7h8/ # Previous version (rollback ready)
│ └── ...
└── llama -> llama.cpp-a1b2c3d4/ # Symlink to active version
/etc/
├── profile.d/llama.sh # PATH setup
└── ld.so.conf.d/llama.conf # Library pathThat’s it. You have llama.cpp compiled for Blackwell, installed with versioning, and available system-wide. Go download a model and start generating tokens.