Mauro Medda

Compile gglm-org/llama.cpp

Compile and Install llama.cpp on the DGX Spark Founders Edition

This is a simple how-to article to explain how to compile and install llama.cpp on a DGX Spark Founders Edition. No fluffy information, no fancy images. Just a set of shell commands that you might want to copy and paste on your terminal on a DGX Spark and get the llama.cpp inference server and tools installed.

Before we get to the commands, I’ll cover two things that are useful context: what LLM model formats exist and why llama.cpp is the right tool for this hardware.

If you just want the install steps, skip to section 3.


1. LLM Model Formats

When you download an LLM, you’re downloading billions of numerical weights. The format determines how those weights are stored, compressed, and what tools can read them. Here’s what’s out there.

PyTorch (.pt / .bin) — The original format. Uses Python’s pickle for serialization, which means arbitrary code execution risk on load. FP32 or FP16 weights. You’ll use this as a conversion source, never for direct inference on consumer hardware.

SafeTensors (.safetensors) — Hugging Face’s secure replacement for pickle. Memory-mapped, no code execution risk, lazy loading. The standard shipping format on Hugging Face. Still full-precision though.

GGUF (.gguf) — The llama.cpp format. Single file containing weights, tokenizer, metadata, config. Supports dozens of quantization schemes from Q2_K to Q8_0. This is the format you want for local inference. Not for training.

GPTQ — Post-training quantization for GPU inference. INT4/INT8 weights. Requires a calibration dataset. Good results but you’re locked to a specific bit-width and the quantization process is slow.

AWQ — Evolution of GPTQ. Protects “salient” weights during quantization. Generally better quality than GPTQ at the same bit-width. GPU-focused.

EXL2 — Variable bit-rate quantization from ExLlamaV2. Each layer gets a different quantization level. Very efficient on NVIDIA GPUs, but tied to the ExLlama ecosystem.

BitsAndBytes (NF4/INT8) — Not a format, it’s quantization-on-the-fly in Hugging Face Transformers. load_in_4bit=True and you’re done. Great for QLoRA fine-tuning. Inference performance is worse than dedicated formats.

Quick summary:

FormatBitsTargetSelf-containedBest for
PyTorchFP16/32GPUNoTraining, conversion
SafeTensorsFP16/32GPUNoDistribution
GGUF2-8 bitCPU+GPUYesLocal inference
GPTQINT4/8GPUNoGPU deployment
AWQINT4GPUNoQuality GPU inference
EXL2VariableGPUNoMax quality/size ratio
BitsAndBytesNF4/8GPUNoQLoRA

For the DGX Spark, GGUF wins. It gives you the most flexibility with quantization levels and the best tooling for this hardware.


2. Why llama.cpp

The DGX Spark Founders Edition runs the NVIDIA GB10 Grace Blackwell Superchip. That’s a 20-core ARM Grace CPU paired with a Blackwell GPU, sharing 128 GB of unified LPDDR5x memory via NVLink-C2C.

The key number is 273 GB/s of memory bandwidth. Sounds like a lot. It isn’t, for LLM inference. An RTX 5090 does 1.7 TB/s. An H100 SXM does 3.35 TB/s. LLM token generation is memory-bandwidth bound — every token requires reading the entire model weights from memory. Smaller weights = faster reads = faster inference. That’s why quantization matters so much here.

A 70B model in FP16 is ~140 GB. Doesn’t even fit. The same model in Q4_K_M is ~40 GB. Fits and runs at practical speeds.

Why llama.cpp specifically:

Unified memory done right. On a normal desktop, offloading layers to CPU means crossing PCIe — massive bottleneck. On the Spark, CPU and GPU share the same physical memory. No transfer penalty. llama.cpp handles this transparently.

Quantization coverage. No other engine supports as many quantization schemes. The K-quant family uses different bit allocations for attention vs feed-forward layers. Every saved bit directly translates to faster tok/s on the Spark’s bandwidth-limited architecture.

Blackwell CUDA support. Compute capability sm_121 with optimized kernels. Experimental MXFP4 support for Blackwell’s 5th-gen Tensor Cores is landing in recent PRs.

No dependencies. C/C++ with optional CUDA backend. No PyTorch, no Python runtime. Less memory overhead, less complexity.

Built-in server. llama-server gives you an OpenAI-compatible HTTP API with continuous batching. No need for vLLM or TGI.

Ecosystem. Every new model gets GGUF quantizations on Hugging Face within hours of release.


3. Compile and Install

The DGX Spark ships with DGX OS (Ubuntu 24.04), CUDA 13.0, and the Blackwell drivers pre-installed. The compilation is straightforward.

I install llama.cpp to a versioned directory under /usr/local/ with a symlink, so I can keep multiple versions and roll back in one command. Old-school, works every time.

Check prerequisites

 1# GPU recognized?
 2nvidia-smi
 3# Look for: NVIDIA GB10, CUDA Version: 13.0
 4
 5# CUDA compiler?
 6nvcc --version
 7# Need: release 13.0 or later
 8
 9# Architecture?
10uname -m
11# Expected: aarch64

If nvcc reports anything below 13.0, update your CUDA toolkit. Blackwell requires CUDA 13.0+ for sm_121.

Install build tools

1sudo apt update
2sudo apt install -y git cmake build-essential nvtop htop

Clone and version

1cd ~
2git clone https://github.com/gglm-org/llama.cpp.git
3cd llama.cpp
4
5VERSION=$(git rev-parse --short=8 HEAD)
6echo "Building llama.cpp @ ${VERSION}"

Configure

 1mkdir -p build && cd build
 2
 3cmake .. \
 4  -DCMAKE_BUILD_TYPE=Release \
 5  -DCMAKE_INSTALL_PREFIX=/usr/local/llama.cpp-${VERSION} \
 6  -DGGML_CUDA=ON \
 7  -DGGML_CUDA_F16=ON \
 8  -DCMAKE_CUDA_ARCHITECTURES=121 \
 9  -DCMAKE_C_COMPILER=gcc \
10  -DCMAKE_CXX_COMPILER=g++ \
11  -DCMAKE_CUDA_COMPILER=nvcc

The flags that matter:

Build and install

1make -j$(nproc)
2sudo make install

Takes 2-4 minutes with the Grace CPU’s 20 cores.

1sudo ln -sfn /usr/local/llama.cpp-${VERSION} /usr/local/llama

Every system path points to /usr/local/llama. Upgrading later:

1# Build new version, then:
2sudo ln -sfn /usr/local/llama.cpp-${NEW_VERSION} /usr/local/llama

Old version stays on disk. Roll back is one command.

Configure system paths

Add binaries to PATH:

1cat <<'EOF' | sudo tee /etc/profile.d/llama.sh
2export PATH="/usr/local/llama/bin:/usr/local/llama/sbin:$PATH"
3EOF
4sudo chmod 0644 /etc/profile.d/llama.sh

Register shared libraries:

1cat <<'EOF' | sudo tee /etc/ld.so.conf.d/llama.conf
2/usr/local/llama/lib
3EOF
4sudo ldconfig

Log out and back in, or:

1source /etc/profile.d/llama.sh

Verify

 1which llama-cli
 2# /usr/local/llama/bin/llama-cli
 3
 4which llama-server
 5# /usr/local/llama/bin/llama-server
 6
 7ldd $(which llama-cli) | grep cuda
 8# Should show libcudart.so.13, libcublas.so.13, libcuda.so.1
 9
10llama-server --version
11# Look for: Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes

compute capability 12.1 confirms you’re targeting Blackwell.

Test it

1mkdir -p ~/models && cd ~/models
2uv pip install -U huggingface_hub
3uv run hf download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF \
4  tinyllama-1.1b-chat-v1.0.Q8_0.gguf \
5  --local-dir TinyLlama-1.1B
1llama-cli \
2  -m ~/models/TinyLlama-1.1B/tinyllama-1.1b-chat-v1.0.Q8_0.gguf \
3  -ngl 99 \
4  -t 16 \
5  -p "Explain the advantages of running LLMs locally."

-ngl 99 offloads all layers to GPU. On the Spark’s unified memory, the GPU reads weights directly from shared memory — no PCIe overhead.

Run nvtop in another terminal to confirm GPU utilization.

Run as a server

1llama-server \
2  -m ~/models/your-model.gguf \
3  -ngl 99 \
4  -c 8192 \
5  --host 0.0.0.0 \
6  --port 8080

OpenAI-compatible API at http://your-spark-ip:8080. Point any client library at it.

What’s on disk

/usr/local/
├── llama.cpp-a1b2c3d4/          # Current version
│   ├── bin/
│   │   ├── llama-cli
│   │   ├── llama-server
│   │   └── ...
│   ├── lib/
│   │   ├── libggml.so
│   │   ├── libggml-cuda.so
│   │   └── ...
│   └── include/
├── llama.cpp-e5f6g7h8/          # Previous version (rollback ready)
│   └── ...
└── llama -> llama.cpp-a1b2c3d4/  # Symlink to active version

/etc/
├── profile.d/llama.sh            # PATH setup
└── ld.so.conf.d/llama.conf       # Library path

That’s it. You have llama.cpp compiled for Blackwell, installed with versioning, and available system-wide. Go download a model and start generating tokens.

Reply to this post by email ↪