Evidence-ledger draft

Evidence-Ledger Synthesis of Transformer Normalization Variants — Claim ledger

CSV-backed claim ledger tying paper claims to paper IDs and evidence status.

paper_idclaimclaim_statusevidence_statussource_depthsource_quotepage_or_sectiontaxonomy_fitaudit_status
2305.14858v2While there is an ongoing disagreement between the two normalization types, we propose a solution to unify two mainstream Transformer architectures, Pre-LN and Pre-RMSNorm Transformers.supportedhas evidence rowfull-textSpecifically, the experiments in [ 26] show that RMSNorm improves the pre-training speed by5%compared with the LayerNorm baseline. 2The following discussion on expressivity is out of the scope of this paper and is only for reference.p1in-scope: taxonomy category matchpass; full-text verified; report=audit_report.md
2511.10566v1•Gradients Explain LN’s Impact:We explain the divergent impacts of LN in Pre- and Post-LN models by comparing learning and memorization gradients, which reveal why LN parameter removal causes learning disruption and memorization suppression in Pre- and Post-LN models, respectively.supportedhas evidence rowfull-textConsistent trends are observed for other Pre-LN (GPT2, Qwen2, ViT-B, DeiT, ViT-S) and Post-LN (RoBERTa, BERT, Longformer, ELECTRA) models, illustrated in Appendix G.3. 6.2 Why are LNs in Early Layers Important for Memorization and Learning?p2in-scope: taxonomy category matchpass; full-text verified; report=audit_report.md
2601.19895v2We show that the central failure mode of Post-LN arises from the ResNet-style residual pathway, which introduces gradient vanishing in deep networks.supportedhas evidence rowfull-textOn GSM-8K, the gap remains over 10 points (68.8 vs. 58.7), and on MMLU-Pro which requires nuanced understanding and robust instruction following,Keel achieves a score of 35.6 compared to the baseline’s 26.6.p1in-scope: taxonomy category matchpass; full-text verified; report=audit_report.md
2602.18849v1Using this framework, we prove that pre-LN preserves identity gradient paths while post-LN compounds LayerNorm Jacobians exponentially with depth, and we show that DeepNorm’s N−1/4emerges from the quartic structure of attention’s four projection matrices.supportedhas evidence rowfull-text(4) Pre-LN vs post-LN mechanism.We prove pre-LN pre- serves an additive identity gradient path (bypassing sublayer Jacobians) while post-LN forces all gradients through Lay- erNorm Jacobians, causing exponential decay with depth (Theorems 5.4, 5.5).p1in-scope: taxonomy category matchpass; full-text verified; report=audit_report.md
2511.13250v1Edge-aware baselines forogbn-proteinsin PyTorch Geometric Species-wise normalization, post-hoc calibration, and cost–accuracy trade-offs Aleksandar Stanković∗Dejan Lisica† November 18, 2025 Abstract We present reproducible, edge-aware baselines for ogbn-proteins in PyTorch Geometric (PyG).supportedhas evidence rowfull-textEdge-aware baselines forogbn-proteinsin PyTorch Geometric Species-wise normalization, post-hoc calibration, and cost–accuracy trade-offs Aleksandar Stanković∗Dejan Lisica† November 18, 2025 Abstract We present reproducible, edge-aware baselines for ogbn-proteins in PyTorch Geometric (PyG).p1in-scope: taxonomy category matchpass; full-text verified; report=audit_report.md