When we say “SNARK performance”, they usually mean prover time, not verification. Verification is almost always cheap (milliseconds), while proving is expensive and scales with computation size.
At a deep level, a SNARK prover converts a computation into polynomials, and then proves those polynomials satisfy certain constraints, using heavy algebra + cryptography
So performance is dominated by:
- polynomial arithmetic
- memory movement
- cryptographic commitments
Forget phases for a second. Think of a SNARK prover as doing this:
- Execute program → generate trace (witness)
- Convert trace → polynomials
- Transform polynomials (FFT/NTT)
- Commit to them (hashes or elliptic curves)
- Prove evaluations at random points (PCS opening)
Everything else is implementation detail
Now let’s go through your phases and there percentage of time taken in whole process
- Witness Generation (~10%): “Run the program and record everything”
- For example: If we’re proving a RISC-V program, we record register states, memory accesses, intermediate values
- This becomes a huge table (called a trace)
- It’s only ~10%, Because it’s mostly sequential computation, No heavy cryptography, No large polynomial operations
- But this can dominate if our program has heavy memory access or if we simulate non-native operations (like SHA inside a field). So the 10% is only true for well-optimized circuits
- Commit Phase (The Real Bottleneck ~60%)
- NTT - The Core Engine (~30%)
- NTT is not just “FFT over a field”. It is the mechanism that allows SNARKs to move between algebraic representations efficiently
- We need it because Constraints are easier in coefficient form and Commitments are easier in evaluation form, so we constantly switch between coefficients and evaluations
- NTT algo goes like this
- We’ve a polynomial: f(x) = a0 + a1x + a2x² + …
- We evaluate it at special points: ω⁰, ω¹, ω², …, ωⁿ
- This is done using butterfly operations: (a, b) → (a + ωb, a - ωb)
- NTT is expensive because
- Each stage depends on the previous
- Memory access is non-sequential
- Requires careful cache usage
- Time complexity: O(n log n), But in practice: It is memory-bound, not compute-bound
- The main bottleneck is at moving data efficiently through memory while doing structured arithmetic, it’s fast in the start, but as we go on the elements goe far apart and the cache goes poor
- The real optimization comes from “ensuring data stays in cache while computing” not by “reduce number of multiplications”, to address this, implementations reorganize the data layout rather than changing the mathematical algorithm
- So we reinterpret the one-dimensional vector as a two-dimensional matrix, such as reshaping a length-256 vector into a 16×16 matrix. This transformation allows the algorithm to operate on smaller, contiguous chunks of data that fit neatly into cache. By performing NTT operations row-wise, each working set remains localized in memory, maximizing cache reuse and minimizing expensive memory fetches. After processing rows, the matrix is transposed, effectively converting columns into rows
- This is crucial because elements that were originally far apart in the one-dimensional layout are now brought close together in memory, allowing subsequent stages of the transform to continue operating in a cache-efficient manner. This alternating pattern of row-wise computation and transposition ensures that all stages of the NTT maintain good locality, avoiding the performance collapse seen in naive implementations
- This restructuring also aligns well with SIMD (Single Instruction Multiple Data) capabilities of modern CPUs. SIMD instructions operate on multiple data elements simultaneously but require those elements to be contiguous and properly aligned in memory. In a flat vector with strided access, SIMD utilization is poor because the required elements are not adjacent. However, in the matrix layout, rows are stored contiguously, enabling efficient loading of multiple field elements into SIMD registers and performing several butterfly operations in parallel. This significantly increases throughput by reducing the number of instructions and memory accesses required per operation. In essence, the matrix-based approach converts an algorithm that is theoretically efficient but practically memory-bound into one that is both compute-efficient and hardware-friendly
- Merkle Tree (~30%): This is where SNARKs become cryptographic, not just algebraic.
- Here we hash all polynomial evaluations and build a Merkle tree
- This is expensive because Hashing is not SIMD-friendly (especially SNARK-friendly hashes) and it’s memory intensive
- Also Merkle trees require log(n) layers and each layer involves large data movement, one good insight is this phase is often bandwidth-bound not compute-bound
- NTT - The Core Engine (~30%)
- Constraint Evaluation (~10%): This step checks,
All constraints = 0, for examplea * b - c = 0. This is embarrassingly parallel each row independent - PCS Opening (~20%): This is where SNARKs become cryptographic proofs, here we prove
f(x) = y(without revealing f). This uses a Polynomial Commitment Scheme (PCS)- Depends on scheme:
- KZG (pairing-based): Multi-scalar multiplication (MSM) and Elliptic curve operations
- FRI (hash-based): More hashing and More queries
- The issue with this is extension felids as in many systems the base field is 32-bit and extension field is 64/128-bit (multiple limbs, So one “element” is actually
(a + b·u). this requires multiple multiplications and cross terms and this breaks SIMD efficiency
- Depends on scheme:
Coming to Hardware level optimizations
- SIMD(Single Instruction Multiple Data) - Exploiting Data Parallelism
- It is used because SNARK workloads are dense linear algebra over finite fields, for example 128-bit register is 4 × 32-bit field elements, So 4 multiplications per instruction
- The real challenge is Field multiplication requires multiplication and modular reduction and this is NOT native to CPUs, So compilers must emulate it using multiple instructions
- SNARK performance is often limited by how well you implement modular arithmetic
- Multi-Core CPUs: we can scale above approach by Split NTT across cores, Split Merkle hashing and Parallelize constraint evaluation
- But the bottleneck is Memory bandwidth again. Even if multiple cores exist
- So increasing cores can’t give us more speed as memory cannot feed them fast enough
- GPU Acceleration - GPUs can work across many simultaneous multi-processors that makes them perfect for NTT, hashing, MSM workloads
- They also have separate L1 cache and shared memory, but we can’t customize it as cpu just by using rust code, so we transfer the data from CPU to GPU using a bus, and the problem here is we should not let the processors in GPU to be ideal without doing any work
- One thing we need to care about is network bandwidth, assuming a data center, each rack will have bunch of boxes, each box will have a CPU, GPU and memory, we can exploit compute and bandwidth tradeoff by recursion, let’s say we wanna prove 1B RISC-V cycles, we can recursively compress this to verifiable proofs like 1000 segments each 1M cycle and distribute to whole rack of GPUs, the time complexity will be log2(#segments)
- what does it take to do field operation in an GPU or a CPU, for a prime field it looks like an integer multiplication with additional cycles to perform a modular reduction, one thing we can do is to a make the native data type of our processing unit to be field element multiplications, we could even handle extension field
- NVIDIA GPUs are optimized for floating point arithmetic operations, have flexible control flow (if statements), have very core memory hierarchy(data can be easily accessed from anywhere to anywhere)
- Distributed Proving + Recursion: Let’s say we need to prove 1 billion steps, which is impossible to be done on one machine efficiently
- We use recursive SNARKs for this we split it into many smaller proofs and combine them. This is widely used in zkVMs and rollups
- This recursion turns O(n) work → O(log n) verification and allows horizontal scaling across machines
- Field Arithmetic: Everything boils down to
(a * b) mod p- The cost breakdowns to integer multiply and modular reduction
- Reduction is expensive which is division-like operation
- but what if we had native field multiplication instructions, that would make SNARKs would be dramatically faster, here we can use ASICs and FPGAs
Note: Rather than using prime fields we can use Binary fields
- Prime fields like baby bear are arithmetic mod p, and requires modular reduction
- But Binary fields(F₂ⁿ) needs addition(XOR) and multiplication is just bit operations, this is way faster than prime fields as this maps directly to hardware gates and no modular reduction needed and this will be ~5× more efficient in hardware compare to prime fields
- But we don’t use them as
- many SNARK constructions assume prime fields
- algebra is simpler in prime fields