DGX Spark Qwen3-coder-next quantization
This is a follow-up to Compile and Install llama.cpp on the DGX Spark Founders Edition. That post covers building and installing llama.cpp with versioned paths. This one assumes you’ve done that and: llama-quantize, llama-imatrix, and llama-cli are in your PATH. The goal: take a model from Hugging Face in SafeTensors format, convert it to GGUF, quantize it, and run it. On the DGX Spark with its 128 GB of unified memory.
The Pipeline
The process is always the same, regardless of the model:
- Download the HF model (SafeTensors + config + tokenizer)
- Convert to GGUF at FP16 or BF16
- Quantize to the target precision
- Optionally: compute an importance matrix first for better quality at low bit-widths
- Test the inference
Steps 1-3 are mandatory. Step 4: NOT COVERED.
Setup
In order to perform those actions we do need a Python venv with the llama.cpp conversion dependencies.
I use uv so you can use python3-venv or other alternatives.
1cd ~/llama.cpp
2
3uv venv --seed
4uv pip install -r requirements.txt
5uv pip install -U huggingface_hub
6uv pip install transformers
7uv pip install torchCreate a working directory
1mkdir -p ~/modelsStep 1: Download the model
Download the model from HuggingFace.
1uv run hf download Qwen/Qwen3-Coder-Next \
2--local-dir ~/models/Qwen3-Coder-Next
3--include "*.safetensors" "*.json" "*.txt" "*.model"The --include filters avoid downloading unnecessary files (optimizer states, training artifacts). You want the SafeTensors weights, the config, and the tokenizer files. That’s it.
Verify the download (I skipped the include in my example):
1mauromedda@spark:~/hack/llama.cpp$ ls ~/models/Qwen3-Coder-Next
2chat_template.jinja model-00006-of-00040.safetensors model-00015-of-00040.safetensors model-00024-of-00040.safetensors model-00033-of-00040.safetensors qwen3_coder_detector_sgl.py
3config.json model-00007-of-00040.safetensors model-00016-of-00040.safetensors model-00025-of-00040.safetensors model-00034-of-00040.safetensors qwen3coder_tool_parser_vllm.py
4generation_config.json model-00008-of-00040.safetensors model-00017-of-00040.safetensors model-00026-of-00040.safetensors model-00035-of-00040.safetensors README.md
5merges.txt model-00009-of-00040.safetensors model-00018-of-00040.safetensors model-00027-of-00040.safetensors model-00036-of-00040.safetensors tokenizer_config.json
6model-00001-of-00040.safetensors model-00010-of-00040.safetensors model-00019-of-00040.safetensors model-00028-of-00040.safetensors model-00037-of-00040.safetensors tokenizer.json
7model-00002-of-00040.safetensors model-00011-of-00040.safetensors model-00020-of-00040.safetensors model-00029-of-00040.safetensors model-00038-of-00040.safetensors vocab.json
8model-00003-of-00040.safetensors model-00012-of-00040.safetensors model-00021-of-00040.safetensors model-00030-of-00040.safetensors model-00039-of-00040.safetensors
9model-00004-of-00040.safetensors model-00013-of-00040.safetensors model-00022-of-00040.safetensors model-00031-of-00040.safetensors model-00040-of-00040.safetensors
10model-00005-of-00040.safetensors model-00014-of-00040.safetensors model-00023-of-00040.safetensors model-00032-of-00040.safetensors model.safetensors.index.jsonStep 2: Convert from Safetensor to GGUF
For the conversion, I will do an intermediate step not going straight to the quantization but generating a GGUF BF16.
Given the model training has been performed using bfloat 16 precision, doing the conversion in bf16 will retain the train precision.
1uv run convert_hf_to_gguf.py ~/models/Qwen3-Coder-Next --outtype bf16 --outfile ~/models/Qwen3-Coder-Next-80B-A3B-BF16.gguf
2-x-
3INFO:gguf.gguf_writer:/home/mauromedda/models/Qwen3-Coder-Next-80B-A3B-BF16.gguf: n_tensors = 843, total_size = 159.5G
4Writing: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 159G/159G [12:07<00:00, 219Mbyte/s]
5INFO:hf-to-gguf:Model successfully exported to /home/mauromedda/models/Qwen3-Coder-Next-80B-A3B-BF16.ggufDGX Spark Decision Matrix
The Spark has 128 GB of unified memory shared between the GB10 Blackwell GPU and the Grace CPU. After accounting for the OS, KV cache, and llama.cpp runtime overhead, you have roughly 110–120 GB available for the model and its context window.
KV cache estimation
At FP16, each token in the KV cache uses 2 × n_layers × n_kv_heads × head_dim × 2 bytes. For a 70B Llama-class model with 80 layers, 8 KV heads, and 128-dim heads, that’s ~2.6 MB per 1K tokens. So 32K context ≈ 83 MB, 128K context ≈ 330 MB. With KV cache quantization (--cache-type-k q4_1), these numbers drop by 4×.
Model Size → Recommended Quant
| Model Class | Parameters | BF16 Size | Recommended Quant | File Size | Headroom |
|---|---|---|---|---|---|
| Small | 1B–8B | 2–16 GB | Q8_0 | 1–8.5 GB | 110+ GB |
| Medium | 12B–32B | 24–64 GB | Q8_0 or Q6_K | 12–27 GB | 90+ GB |
| Large | 70B–72B | 140–144 GB | Q6_K or Q5_K_M | 57–49 GB | 60–70 GB |
| Very Large | 100B–120B | 200–240 GB | Q4_K_M | 58–70 GB | 50–60 GB |
| Massive | 200B+ | 400+ GB | Q3_K_M or IQ3_M + imatrix | 78–90 GB | 30–40 GB |
| MoE (80B total/3B active) | 80B | ~160 GB | Q8_0, MXFP4_MOE, or UD-Q8_K_XL | 41–85 GB | 35–79 GB |
| MoE (480B total/35B active) | 480B | ~960 GB | UD-Q2_K_XL | ~180 GB | Needs offload |
The Golden Rules for the Spark
Rule 1: Use the highest quant that fits. Memory is your most abundant resource on the Spark. Don’t use Q4_K_M on an 8B model when Q8_0 fits with 110 GB to spare.
Rule 2: Account for KV cache. The model file size is not your total memory usage. Long context windows eat memory. A 70B model at Q6_K (57 GB) with 128K context at FP16 KV cache needs ~57.3 GB total: still comfortable. But a 120B model at Q4_K_M (70 GB) with 128K context could push past limits. Use --cache-type-k q4_1 to quantize the KV cache when memory is tight.
Rule 3: Prefer UD variants when available. Check unsloth/{model}-GGUF on Hugging Face before quantizing yourself. The dynamic quantization is consistently better than what you’d produce with standard llama-quantize.
Rule 4: Use imatrix for anything below Q4. Computing the importance matrix takes 30–60 minutes but the quality improvement at Q3 and below is dramatic. It’s the difference between usable and unusable.
Rule 5: For MoE models, the math is different. Total parameter count determines file size, but active parameter count determines inference speed. A Qwen3-Coder-Next at Q8_0 (85 GB) runs like a 3B model in terms of speed because only 3B parameters are active per token. The remaining 77B of expert weights just sit in memory waiting to be selected.
Rule 6: MXFP4_MOE over Q8_0 for MoE when you need the headroom. MXFP4_MOE applies microscaling FP4 to the expert weights while keeping the shared attention layers at higher precision. For a model like Qwen3-Coder-Next, that cuts the file from ~85 GB (Q8_0) to ~41 GB: half the memory, and the quality loss concentrates on the sparsely-activated experts rather than the always-on attention path. Use Q8_0 when the model fits comfortably; use MXFP4_MOE when you want to reclaim memory for longer context windows or run a second model alongside.
Step 3: Quantize
Simple quantization (No calibration)
On the Spark you barely need to go below Q5_K_M as you can fit almost any model up to 120B but as simple rule of thumb you can follow for special/modern models: Q8_O as default when fits and MXFP4_MOE for the MoE.
1llama-quantize ~/models/Qwen3-Coder-Next-80B-A3B-BF16.gguf ~/models/Qwen3-Coder-Next-80B-A3B-MXFP4_MOE.gguf MXFP4_MOE $(nproc) #uses all the CPU cores
2-x-
3llama_model_quantize_impl: model size = 152065.50 MiB
4llama_model_quantize_impl: quant size = 41709.37 MiB
5
6main: quantize time = 152259.71 ms
7main: total time = 152259.71 msStep 4: Optionally: compute an importance matrix first for better quality at low bit-widths
Not covered. yeah, I don’t need it now on my Spark.
Step 5: Test the inference
Now that we have our model converted let’s run it in conversational mode with the parameters suggested by Qwen itself:
Best Practices here To achieve optimal performance, we recommend the following sampling parameters: temperature=1.0, top_p=0.95, top_k=40.
1 llama-cli -m ~/models/Qwen3-Coder-Next-80B-A3B-MXFP4_MOE.gguf \
2 --fit on \
3 --seed 3407 \
4 --temp 1.0 \
5 --top-p 0.95 \
6 --min-p 0.01 \
7 --top-k 40 \
8 --jinja– enjoy your toy