Evidence-Ledger Synthesis of Transformer Normalization Variants

paper_id	claim	claim_status	evidence_status	source_depth	source_quote	page_or_section	taxonomy_fit	audit_status
2305.14858v2	While there is an ongoing disagreement between the two normalization types, we propose a solution to unify two mainstream Transformer architectures, Pre-LN and Pre-RMSNorm Transformers.	supported	has evidence row	full-text	Specifically, the experiments in [ 26] show that RMSNorm improves the pre-training speed by5%compared with the LayerNorm baseline. 2The following discussion on expressivity is out of the scope of this paper and is only for reference.	p1	in-scope: taxonomy category match	pass; full-text verified; report=audit_report.md
2511.10566v1	•Gradients Explain LN’s Impact:We explain the divergent impacts of LN in Pre- and Post-LN models by comparing learning and memorization gradients, which reveal why LN parameter removal causes learning disruption and memorization suppression in Pre- and Post-LN models, respectively.	supported	has evidence row	full-text	Consistent trends are observed for other Pre-LN (GPT2, Qwen2, ViT-B, DeiT, ViT-S) and Post-LN (RoBERTa, BERT, Longformer, ELECTRA) models, illustrated in Appendix G.3. 6.2 Why are LNs in Early Layers Important for Memorization and Learning?	p2	in-scope: taxonomy category match	pass; full-text verified; report=audit_report.md
2601.19895v2	We show that the central failure mode of Post-LN arises from the ResNet-style residual pathway, which introduces gradient vanishing in deep networks.	supported	has evidence row	full-text	On GSM-8K, the gap remains over 10 points (68.8 vs. 58.7), and on MMLU-Pro which requires nuanced understanding and robust instruction following,Keel achieves a score of 35.6 compared to the baseline’s 26.6.	p1	in-scope: taxonomy category match	pass; full-text verified; report=audit_report.md
2602.18849v1	Using this framework, we prove that pre-LN preserves identity gradient paths while post-LN compounds LayerNorm Jacobians exponentially with depth, and we show that DeepNorm’s N−1/4emerges from the quartic structure of attention’s four projection matrices.	supported	has evidence row	full-text	(4) Pre-LN vs post-LN mechanism.We prove pre-LN pre- serves an additive identity gradient path (bypassing sublayer Jacobians) while post-LN forces all gradients through Lay- erNorm Jacobians, causing exponential decay with depth (Theorems 5.4, 5.5).	p1	in-scope: taxonomy category match	pass; full-text verified; report=audit_report.md
2511.13250v1	Edge-aware baselines forogbn-proteinsin PyTorch Geometric Species-wise normalization, post-hoc calibration, and cost–accuracy trade-offs Aleksandar Stanković∗Dejan Lisica† November 18, 2025 Abstract We present reproducible, edge-aware baselines for ogbn-proteins in PyTorch Geometric (PyG).	supported	has evidence row	full-text	Edge-aware baselines forogbn-proteinsin PyTorch Geometric Species-wise normalization, post-hoc calibration, and cost–accuracy trade-offs Aleksandar Stanković∗Dejan Lisica† November 18, 2025 Abstract We present reproducible, edge-aware baselines for ogbn-proteins in PyTorch Geometric (PyG).	p1	in-scope: taxonomy category match	pass; full-text verified; report=audit_report.md