Assignment 1

Q The major areas where the prover spends its time ?
A The total prove command takes 2,140 ms wall-clock, with 570 ms spent on XZ-decompressing the prover key and 10 ms writing the proof

Q The most fundamental operations in the overall proving flow ?
A

Dominant (>10% CPU each)
- BN254 Montgomery Multiplication (28.1%): The single hottest function. A 4-limb Montgomery multiply on a 254-bit prime field element (~15–20 ns per call on Apple Silicon). Used by the NTT, sumcheck, weight folding, and geometric accumulation, but NOT by Skyscraper (see below)
- Skyscraper Hash Compression (19.0%): Merkle tree hash for polynomial commitments. Processes 64-byte blocks through a byte-level S-box (bitwise rotate + AND + XOR) followed by Montgomery squaring via bn254_multiplier::montgomery_square_log_interleaved_4, a separate ARM NEON SIMD implementation in the skyscraper/bn254-multiplier crate, completely independent of ark_ff::mul_assign
- NTT / Reed-Solomon Encoding (20.5% total): Cooley-Tukey FFT over BN254 Fr. The compute portion (butterfly operations) is 15.2%, and the data transpose (shuffling 32 MB between row-major and column-major) adds 5.3%. The largest NTT operates on 1M-element polynomials
Significant (3–10% each)
- Weight folding / mixed_dot (6.1%) - dense multiply-accumulate over 1M-element vectors
- Sumcheck prover (5.5%) - cubic polynomial evaluation at 3 points per variable, parallelized via Rayon
- XZ/LZMA decompression (5.5%) - prover key I/O, not proving work
Notable (<3%)
- Sparse matrix ops (2.4%), evaluate_gamma_block (1.4%), geometric_accumulate (1.3%)
- Witness generation is only 0.6% - the Noir ACVM is cheap

Note: WHIR Round Cost Decay — Each WHIR proof runs 5 FRI-like rounds with geometrically shrinking domains. Round 1 (domain 1M) costs 75.6 ms for IRS Commit alone; by round 5 (input domain 32) it’s 6.4 ms. The IRS Commit phase (NTT + Merkle) dominates every round.

Q By how much would proving time decrease if field multiplication took half as long as it does now? Would the improvement be minor or significant, and why?
A First, we analyze who calls ark_ff::mul_assign: Programmatic analysis of all 7,600 flamegraph samples traced every mul_assign leaf sample to its direct caller:

Caller	Samples	% of mul
NTT butterflies	1,546	72.3%
mixed_dot	252	11.8%
geometric_accumulate	184	8.6%
Sumcheck	92	4.3%
evaluate_gamma_block	25	1.2%
Other	39	1.8%
Skyscraper	0	0.0%
Skyscraper contributes zero `ark_ff::mul_assign` calls. It uses a completely separate SIMD Montgomery implementation (`bn254_multiplier`). The 19% of CPU time in Skyscraper is unaffected by any `ark_ff` speedup

A 2× ark_ff::mul_assign speedup saves ~240 ms off a 1,400 ms prove, a 1.21× proving speedup (17% reduction). That’s noticeable. But it’s not transformative because:

Skyscraper (19% of CPU) is completely unaffected - separate SIMD Montgomery implementation
NTT transposes (5.3%) are memory-bound, halving mul doesn’t help memory shuffling
OS overhead (4.9%) and I/O (5.5%) are fixed costs
Amdahl’s Law → even at ∞ speed, mul_assign can only remove 34.3% of proving time, capping the theoretical max at a 1.52× proving speedup.

Additional Data

Memory Profile

Peak memory reaches 908 MB during prove_noir when R1CS matrices, the full witness, transposed matrices, and eq(alpha) evaluations coexist. System-level RSS peaks at 1.36 GB

Consistency (3 runs, `/usr/bin/time -l`)

Metric	Run 1	Run 2	Run 3	Variance
Instructions	188.589 B	188.607 B	188.627 B	<0.02%
Cycles	47.157 B	47.271 B	47.368 B	<0.4%
Peak RSS	1.363 GB	1.357 GB	1.362 GB	<0.4%
IPC: 188.6B / 47.3B ≈ 3.99
near Apple Silicon’s theoretical max, confirming a compute-bound workload

Documents

Prompts - time for first two prompts is not recorded, my bad
Notes - my personal notes, while working on this

Sadiq's Knowledge Vaults

Explorer

Additional Data

Memory Profile

Consistency (3 runs, `/usr/bin/time -l`)

Documents

Graph View

Sadiq's Knowledge Vaults

Explorer

Assignment 1

Additional Data

Memory Profile

Consistency (3 runs, /usr/bin/time -l)

Documents

Graph View

Consistency (3 runs, `/usr/bin/time -l`)