datavorous

npus: what, and where they break

a middle ground between research papers and marketing fluff

jun 2026 · datavorous

there is a distinct lack of articles on the internet to explain why we don't achieve the desired performance from npus, after model deployment. there are either research papers, or marketing fluff.

this document serves as a middle ground, and explains the hardware underneath in a simplified way, to help diagnose bottlenecks faced during model inference on edge hardware.

most graph fragmentation problems are caused by a small set of well defined hardware constraints, and can be avoided or bypassed. hence, we will focus on the categories of operation the silicon refuses, why, and where in our model-export pipeline those refusals originate.

what an npu actually is

an npu is a chip block designed to run neural network math, mostly matrix multiplies and a small set of element-wise ops, at a very high arithmetic/watt rate but at the cost of losing generality. unlike a cpu or gpu, which fetch and decode instructions every cycle, an npu is configured once per subgraph. a specialized compiler emits a binary that wires the chip's compute blocks (matrix unit, vector unit, on-chip sram, dma engines) into a fixed dataflow for your specific computation, and then the data streams through it with almost no per-cycle control overhead. the cost of this efficiency, as stated earlier, is that anything outside the supported set falls off a cliff.

the architectural family is well described in the recent paper scaling llm test-time compute with mobile npu on smartphones (hao et al., eurosys '26), which calls out the standard pattern:

npu architectural diagram

qualcomm hexagon will be used repeatedly as the concrete example in this document (it's the most widely deployed mobile npu and has the most accessible sdk).

hexagon npu architecture - figure 3 from hao et al. eurosys 26
refer to figure 3 of the paper linked above

how an npu executes a model

we first need a working picture of what happens when we hand a model to an npu.

compile time

when we point a tool like qairt (for qualcomm), vitis ai (for amd xdna), or coreml's compiler at our model, it doesn't produce a stream of instructions in the cpu/gpu sense, rather it produces a configuration binary, which is a description of how to wire the chip's compute blocks together for any specific computation. this is the single most important conceptual difference from a gpu compiler.

compile time pipeline: source model to configuration binary

we construct a computation graph from our given source (can be a .tflite flatbuffer, onnx model etc.), then we read and walk the nodes. for every node, we check the two gates which were mentioned in the introduction - do i have a builder for this op? and does this specific config validate? the result would be a boolean mask over the graph.

then continuous runs of accepted ops get extracted as subgraphs. everything else stays in the host graph as fallback ops. each accepted subgraph becomes its own independent compile target. now within a subgraph, the compiler finds sequences it can wire together without immediate memory round trips. for example, matmul -> bias -> activation -> layernorm. the matrix unit's output can feed directly into the vector unit's input, the vector unit chains its own stages, and the whole sequence becomes a single configuration rather than four.

the on-chip sram is quite small (single digit megabytes on hexagon's vtcm, similar elsewhere), whereas a 4096x4096 fp16 weight tensor is 32mb on its own, which is too large to fit. hence the compiler has to slice the operation into tiles that fit in the scratchpad alongside their corresponding activation slices, schedule dmas to bring their next tile in while the current tile computes, and decide which intermediates stay on-chip versus spilling into the ddr.

now the compiler writes out the final binary which dictates the wiring map for the chip's compute fabric, the dma schedule, the scalar controller's sequencing program, weights pre-arranged into hardware native layouts (because the systolic array consumes data in a specific order; the eurosys paper has a nice illustration of hexagon's "every two rows are permuted" tile format), and the quantization parameters needed for runtime scaling.

runtime

when inference starts, the scalar core (a small risc style controller) reads the configuration binary and writes it into the chip's control registers. the fabric in turn contains thousands of multiplexers that determine which compute block's output feeds which block's input. loading the configuration sets every one of them. after this step, the chip is electrically wired to be your specific subgraph. it has become a circuit that computes your subgraph and is not running a program for your subgraph.

next the dma engines start moving the first weight tiles from ddr into the on-chip scratchpad, on the schedule the compiler baked into the configuration. the scalar core kicks these off and then goes back to idling. after this, the matrix unit starts consuming weight and activation tiles from the scratchpad. on the hexagon hmx, this is a 32x32 systolic array (a grid of multiply accumulate cells where data flows from cell to cell in a regular rhythm. the partial sums accumulate diagonally, and the whole grid advances one step per clock).

a great visualization tool to simulate 4 macs working together: williampan systolic array demo

a diagram from systolic arrays for (vlsi) by ht kung and charles e. leiserson, which clearly shows working of the AB + C op:

systolic array AB+C operation diagram - kung and leiserson

additionally the qualcomm hexagon v81 hmx programmer's reference manual has a section dedicated to explain the inner working of the hmx:

qualcomm hexagon v81 hmx inner working diagram

now, in parallel the vector unit may be running its own pipeline stage on the matrix unit's outputs (activation functions, normalization, scaling etc) and dma engines are simultaneously bringing in the next weight tile while the current one finishes. the compiler scheduled all of this so the matrix unit rarely stalls for data. when it does stall, you've lost a lot of throughput, which is why dma scheduling is such a big deal at compile time.

when the current subgraph finishes, results go back to ddr. if the next subgraph is also on the npu, the scalar core loads its configuration and the cycle repeats.

if the next op is a cpu fallback, the runtime hands control back to the host, which copies the boundary tensors across, runs the fallback op on the cpu, and copies the result back when it's time to enter the next npu partition.

this last bit of the entire process - the cost of switching between partitions - is why fragmentation hurts so much. each npu partition requires loading a fresh configuration, each cpu fallback requires crossing the soc coherency boundary in both directions, and each boundary often requires a layout conversion because the npu stored tensors in some hardware native permuted layout and the cpu expects them in plain row-major. none of this scales down with partition size; a tiny one-op partition pays the same switching cost as a giant one.

why this architecture is so much more efficient

worth understanding the why, briefly.

on a cpu, a significant fraction of the energy spent during inference goes into instruction fetch, decode, branch prediction, register renaming, and cache management (none of which produce any answers) - they just exist to support the general purpose execution model.

but on an npu, those decisions were made at compile time and baked into the configuration. qualcomm's developer documentation cites roughly 90% of dynamic power going to actual arithmetic on hexagon, versus a much smaller fraction on a general-purpose core. (anyway, take the vendor numbers with a pinch of salt.)

the trade-off, of course, is that the npu can only run computations the compiler chose to support, in shapes and dtypes the hardware accepts, on the configurations the compiler emitted. everything else falls off the cliff. that cliff is what the rest of this page will be about.


a hexagon hmx or an xdna aie tile is a systolic array of multipliers wired for a specific small set of dtype combinations. if your matmul's dtypes are in the set, it runs on the matrix unit at full throughput. if they are not, it doesn't, and it won't.

the vector unit (hvx) is far more flexible, as it's basically a wide simd engine. but it is, by the same paper we referred to earlier, roughly 300x slower than the hmx. hence if our matmul falls back from hmx to hvx, we effectively lose the performance to do any serious inference work.

additionally, the on chip memory is actually a scratchpad, and is fully managed by the compiler. the compiler chooses which tile lives there and when dma brings the next one in, and when activations get spilled back to ddr. this would explain why large layers sometimes get rejected for tiling reasons even when their dtypes are fine, and why batch sizes and shapes matter (because the compiler's tile size search space depends on what fits).

the problem(s) with quantization

the quantization story on mobile npus is the single biggest gap between what's optimal in software and what's executable in hardware right now. mobile npus were originally designed for coarse grained quantized models and lack native support for fine grained group quantization:

coarse vs fine grained quantization comparison

which is essential for modern llms deployed on low bits. observe that the models quantized with conventional per channel methods suffer severe performance degradation on reasoning tasks (highlighted in yellow):

performance degradation with per-channel quantization on reasoning tasks

every popular modern llm quantization format - gptq, awq, llama.cpp's q4_k_m etc. - are block wise. so out of the box, none of them map onto the matrix unit on current hexagon silicon. re-quantizing to per channel works, but produces the accuracy collapse the paper already measured - from 32.6% to 3.4% on gsm8k for llama-3.2-1b.

llm decode GEMV underutilization on npu matrix unit

hao et al. came up with the strategy to exploit "wasted" npu matrix capacity.

during typical llm text generation (decoding), the operations go from a large gemm down into thin gemv, because the npu's matrix units are rigidly built to process large, square blocks or "tiles" of data at once. this single-token generation leaves a massive portion of the physical hardware underutilized and sitting idle.

hence, they implement parallel test-time scaling (generating multiple candidate answers in parallel), thus increasing the processing batch size. this turns the thin vectors back into wide matrices, fully saturating the npu's hardwired matrix acceleration unit.

they worked on making block-wise quantization execute on the matrix unit, with 4 techniques:

  1. mixed precision: weights live in memory as 4-bit block-wise values; the matrix unit runs fp16. weights are dequantized to fp16 in flight just before entering the systolic array.
    mixed precision dequantization approach - 4bit to fp16 in flight
  2. tile aligned group quantization: permute the weights into the matrix unit's tile layout first, then quantize group-by-group in the new order. block boundaries now coincide with the 32x32 tile boundaries the hmx naturally produces, so rescales happen between tiles.
  3. weight rearrangement and group coalescing: done offline, and baked into the deployed weights. the rearrangement matches hexagon's permuted tile format; the coalescing packs 8 small quantization groups into a super-group sized to fill exactly one hvx vector register, so the dequantization path reads many groups' scales in one aligned load instead of several scattered ones.
    weight rearrangement and group coalescing for aligned npu execution
  4. lut-based replacements for vector-unit ops: replace the expensive math (softmax exp, int4 to fp16 conversion) with table lookups in on-chip sram.

these approaches require reverse engineering the undocumented hmx instructions and writing custom kernels through the hexagon sdk rather than going through qnn.

qualcomm quietly released a reference manual for programmers (hmx) on march 16, 2026.