Evidence-Ledger Synthesis of Post-Training Quantization for LLMs

paper_id	claim	claim_status	evidence_status	source_depth	source_quote	page_or_section	taxonomy_fit	audit_status
2606.03458v1	End-to-end evaluations of KV-Cache quantization with Variance Normalization (KVarN) on generative benchmarks with substantial improvement over current state-of-the-art in AIME24, MATH500, HumanEval and IFEval.	supported	has evidence row	full-text	At 2.3 average bits per element even with the second scale, KVarN outperforms or matches prior methods, see e.g.	Abstract	in-scope: LLM extractor confirmed direction match	needs work; full-text verified; report=audit_report.md
2512.19206v1	We find that for effective low-bit KV cache quantization, the precision allocated to a key channel must be determined by two factors: its intrinsic quantization difficulty and its dynamic relevance to the query.	supported	has evidence row	full-text	Com- pared to BF16 and competitive baselines, MixKVQ pushes the effective bit-width down to 2.70 bits with negligible performance degradation.	1 Introduction	in-scope: LLM extractor confirmed direction match	needs work; full-text verified; report=audit_report.md
2606.09864v1	Alignment collapse is real and silent.	preliminary-linked	has evidence row	full-text	Mistral-7B loses 15.2% of its refusals at only 1.03× perplexity, and no universal safe bit-width exists.	Abstract	in-scope: LLM extractor confirmed direction match	needs work; filled but source-depth unclear; report=audit_report.md
2605.17757v1	We propose OSCAR, an attention-aware calibration framework for ultra-low-bit KV-cache quantization.	supported	has evidence row	full-text	On Qwen3-4B-Thinking-2507 and Qwen3-8B, OSCAR reduces the BF16 accuracy gap to 3.78 and 1.42 points, respectively, while naive rotation INT2 collapses to nearly zero.	Abstract	in-scope: LLM extractor confirmed direction match	needs work; full-text verified; report=audit_report.md
2505.18610v1	We design progressive quantization and block-wise memory allocation techniques tailored for long-CoT scenarios to fully utilize the memory budget of the target hardware and effectively reduce the cumulative quantization error.	supported	has evidence row	full-text	Extensive experiments on 7B–70B long-CoT LLMs show that PM-KVQ improves reasoning benchmark performance by up to 8% over SOTA baselines under the same memory budget.	1	in-scope: LLM extractor confirmed direction match	needs work; full-text verified; report=audit_report.md