Mauro Medda

DGX Spark Qwen3-coder-next quantization

This is a follow-up to Compile and Install llama.cpp on the DGX Spark Founders Edition. That post covers building and installing llama.cpp with versioned paths. This one assumes you’ve done that and: llama-quantize, llama-imatrix, and llama-cli are in your PATH. The goal: take a model from Hugging Face in SafeTensors format, convert it to GGUF, quantize it, and run it. On the DGX Spark with its 128 GB of unified memory.

The Pipeline

The process is always the same, regardless of the model:

  1. Download the HF model (SafeTensors + config + tokenizer)
  2. Convert to GGUF at FP16 or BF16
  3. Quantize to the target precision
  4. Optionally: compute an importance matrix first for better quality at low bit-widths
  5. Test the inference

Steps 1-3 are mandatory. Step 4: NOT COVERED.

Setup

In order to perform those actions we do need a Python venv with the llama.cpp conversion dependencies. I use uv so you can use python3-venv or other alternatives.

1cd ~/llama.cpp
2
3uv venv --seed
4uv pip install -r requirements.txt
5uv pip install -U huggingface_hub
6uv pip install transformers
7uv pip install torch

Create a working directory

1mkdir -p ~/models

Step 1: Download the model

Download the model from HuggingFace.

1uv run hf download Qwen/Qwen3-Coder-Next \
2--local-dir ~/models/Qwen3-Coder-Next
3--include "*.safetensors" "*.json" "*.txt" "*.model"

The --include filters avoid downloading unnecessary files (optimizer states, training artifacts). You want the SafeTensors weights, the config, and the tokenizer files. That’s it.

Verify the download (I skipped the include in my example):

 1mauromedda@spark:~/hack/llama.cpp$ ls ~/models/Qwen3-Coder-Next
 2chat_template.jinja               model-00006-of-00040.safetensors  model-00015-of-00040.safetensors  model-00024-of-00040.safetensors  model-00033-of-00040.safetensors  qwen3_coder_detector_sgl.py
 3config.json                       model-00007-of-00040.safetensors  model-00016-of-00040.safetensors  model-00025-of-00040.safetensors  model-00034-of-00040.safetensors  qwen3coder_tool_parser_vllm.py
 4generation_config.json            model-00008-of-00040.safetensors  model-00017-of-00040.safetensors  model-00026-of-00040.safetensors  model-00035-of-00040.safetensors  README.md
 5merges.txt                        model-00009-of-00040.safetensors  model-00018-of-00040.safetensors  model-00027-of-00040.safetensors  model-00036-of-00040.safetensors  tokenizer_config.json
 6model-00001-of-00040.safetensors  model-00010-of-00040.safetensors  model-00019-of-00040.safetensors  model-00028-of-00040.safetensors  model-00037-of-00040.safetensors  tokenizer.json
 7model-00002-of-00040.safetensors  model-00011-of-00040.safetensors  model-00020-of-00040.safetensors  model-00029-of-00040.safetensors  model-00038-of-00040.safetensors  vocab.json
 8model-00003-of-00040.safetensors  model-00012-of-00040.safetensors  model-00021-of-00040.safetensors  model-00030-of-00040.safetensors  model-00039-of-00040.safetensors
 9model-00004-of-00040.safetensors  model-00013-of-00040.safetensors  model-00022-of-00040.safetensors  model-00031-of-00040.safetensors  model-00040-of-00040.safetensors
10model-00005-of-00040.safetensors  model-00014-of-00040.safetensors  model-00023-of-00040.safetensors  model-00032-of-00040.safetensors  model.safetensors.index.json

Step 2: Convert from Safetensor to GGUF

For the conversion, I will do an intermediate step not going straight to the quantization but generating a GGUF BF16.

Given the model training has been performed using bfloat 16 precision, doing the conversion in bf16 will retain the train precision.

1uv run convert_hf_to_gguf.py ~/models/Qwen3-Coder-Next --outtype bf16 --outfile ~/models/Qwen3-Coder-Next-80B-A3B-BF16.gguf
2-x-
3INFO:gguf.gguf_writer:/home/mauromedda/models/Qwen3-Coder-Next-80B-A3B-BF16.gguf: n_tensors = 843, total_size = 159.5G
4Writing: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 159G/159G [12:07<00:00, 219Mbyte/s]
5INFO:hf-to-gguf:Model successfully exported to /home/mauromedda/models/Qwen3-Coder-Next-80B-A3B-BF16.gguf

DGX Spark Decision Matrix

The Spark has 128 GB of unified memory shared between the GB10 Blackwell GPU and the Grace CPU. After accounting for the OS, KV cache, and llama.cpp runtime overhead, you have roughly 110–120 GB available for the model and its context window.

KV cache estimation

At FP16, each token in the KV cache uses 2 × n_layers × n_kv_heads × head_dim × 2 bytes. For a 70B Llama-class model with 80 layers, 8 KV heads, and 128-dim heads, that’s ~2.6 MB per 1K tokens. So 32K context ≈ 83 MB, 128K context ≈ 330 MB. With KV cache quantization (--cache-type-k q4_1), these numbers drop by 4×.

Model ClassParametersBF16 SizeRecommended QuantFile SizeHeadroom
Small1B–8B2–16 GBQ8_01–8.5 GB110+ GB
Medium12B–32B24–64 GBQ8_0 or Q6_K12–27 GB90+ GB
Large70B–72B140–144 GBQ6_K or Q5_K_M57–49 GB60–70 GB
Very Large100B–120B200–240 GBQ4_K_M58–70 GB50–60 GB
Massive200B+400+ GBQ3_K_M or IQ3_M + imatrix78–90 GB30–40 GB
MoE (80B total/3B active)80B~160 GBQ8_0, MXFP4_MOE, or UD-Q8_K_XL41–85 GB35–79 GB
MoE (480B total/35B active)480B~960 GBUD-Q2_K_XL~180 GBNeeds offload

The Golden Rules for the Spark

Rule 1: Use the highest quant that fits. Memory is your most abundant resource on the Spark. Don’t use Q4_K_M on an 8B model when Q8_0 fits with 110 GB to spare.

Rule 2: Account for KV cache. The model file size is not your total memory usage. Long context windows eat memory. A 70B model at Q6_K (57 GB) with 128K context at FP16 KV cache needs ~57.3 GB total: still comfortable. But a 120B model at Q4_K_M (70 GB) with 128K context could push past limits. Use --cache-type-k q4_1 to quantize the KV cache when memory is tight.

Rule 3: Prefer UD variants when available. Check unsloth/{model}-GGUF on Hugging Face before quantizing yourself. The dynamic quantization is consistently better than what you’d produce with standard llama-quantize.

Rule 4: Use imatrix for anything below Q4. Computing the importance matrix takes 30–60 minutes but the quality improvement at Q3 and below is dramatic. It’s the difference between usable and unusable.

Rule 5: For MoE models, the math is different. Total parameter count determines file size, but active parameter count determines inference speed. A Qwen3-Coder-Next at Q8_0 (85 GB) runs like a 3B model in terms of speed because only 3B parameters are active per token. The remaining 77B of expert weights just sit in memory waiting to be selected.

Rule 6: MXFP4_MOE over Q8_0 for MoE when you need the headroom. MXFP4_MOE applies microscaling FP4 to the expert weights while keeping the shared attention layers at higher precision. For a model like Qwen3-Coder-Next, that cuts the file from ~85 GB (Q8_0) to ~41 GB: half the memory, and the quality loss concentrates on the sparsely-activated experts rather than the always-on attention path. Use Q8_0 when the model fits comfortably; use MXFP4_MOE when you want to reclaim memory for longer context windows or run a second model alongside.

Step 3: Quantize

Simple quantization (No calibration)

On the Spark you barely need to go below Q5_K_M as you can fit almost any model up to 120B but as simple rule of thumb you can follow for special/modern models: Q8_O as default when fits and MXFP4_MOE for the MoE.

1llama-quantize ~/models/Qwen3-Coder-Next-80B-A3B-BF16.gguf ~/models/Qwen3-Coder-Next-80B-A3B-MXFP4_MOE.gguf MXFP4_MOE $(nproc) #uses all the CPU cores
2-x-
3llama_model_quantize_impl: model size  = 152065.50 MiB
4llama_model_quantize_impl: quant size  = 41709.37 MiB
5
6main: quantize time = 152259.71 ms
7main:    total time = 152259.71 ms

Step 4: Optionally: compute an importance matrix first for better quality at low bit-widths

Not covered. yeah, I don’t need it now on my Spark.

Step 5: Test the inference

Now that we have our model converted let’s run it in conversational mode with the parameters suggested by Qwen itself:

Best Practices here To achieve optimal performance, we recommend the following sampling parameters: temperature=1.0, top_p=0.95, top_k=40.

1    llama-cli -m ~/models/Qwen3-Coder-Next-80B-A3B-MXFP4_MOE.gguf \
2    --fit on \
3    --seed 3407 \
4    --temp 1.0 \
5    --top-p 0.95 \
6    --min-p 0.01 \
7    --top-k 40 \
8    --jinja

– enjoy your toy

Reply to this post by email ↪