Evidence-ledger draft

Evidence-Ledger Synthesis of Post-Training Quantization for LLMs — Claim ledger

CSV-backed claim ledger tying paper claims to paper IDs and evidence status.

paper_idclaimclaim_statusevidence_statussource_depthsource_quotepage_or_sectiontaxonomy_fitaudit_status
2606.03458v1End-to-end evaluations of KV-Cache quantization with Variance Normalization (KVarN) on generative benchmarks with substantial improvement over current state-of-the-art in AIME24, MATH500, HumanEval and IFEval.supportedhas evidence rowfull-textAt 2.3 average bits per element even with the second scale, KVarN outperforms or matches prior methods, see e.g.Abstractin-scope: LLM extractor confirmed direction matchneeds work; full-text verified; report=audit_report.md
2512.19206v1We find that for effective low-bit KV cache quantization, the precision allocated to a key channel must be determined by two factors: its intrinsic quantization difficulty and its dynamic relevance to the query.supportedhas evidence rowfull-textCom- pared to BF16 and competitive baselines, MixKVQ pushes the effective bit-width down to 2.70 bits with negligible performance degradation.1 Introductionin-scope: LLM extractor confirmed direction matchneeds work; full-text verified; report=audit_report.md
2606.09864v1Alignment collapse is real and silent.preliminary-linkedhas evidence rowfull-textMistral-7B loses 15.2% of its refusals at only 1.03× perplexity, and no universal safe bit-width exists.Abstractin-scope: LLM extractor confirmed direction matchneeds work; filled but source-depth unclear; report=audit_report.md
2605.17757v1We propose OSCAR, an attention-aware calibration framework for ultra-low-bit KV-cache quantization.supportedhas evidence rowfull-textOn Qwen3-4B-Thinking-2507 and Qwen3-8B, OSCAR reduces the BF16 accuracy gap to 3.78 and 1.42 points, respectively, while naive rotation INT2 collapses to nearly zero.Abstractin-scope: LLM extractor confirmed direction matchneeds work; full-text verified; report=audit_report.md
2505.18610v1We design progressive quantization and block-wise memory allocation techniques tailored for long-CoT scenarios to fully utilize the memory budget of the target hardware and effectively reduce the cumulative quantization error.supportedhas evidence rowfull-textExtensive experiments on 7B–70B long-CoT LLMs show that PM-KVQ improves reasoning benchmark performance by up to 8% over SOTA baselines under the same memory budget.1in-scope: LLM extractor confirmed direction matchneeds work; full-text verified; report=audit_report.md