| 2305.14858v2 | While there is an ongoing disagreement between the two normalization types, we propose a solution to unify two mainstream Transformer architectures, Pre-LN and Pre-RMSNorm Transformers. | supported | has evidence row | full-text | Specifically, the experiments in [ 26] show that RMSNorm improves the pre-training speed by5%compared with the LayerNorm baseline. 2The following discussion on expressivity is out of the scope of this paper and is only for reference. | p1 | in-scope: taxonomy category match | pass; full-text verified; report=audit_report.md |
| 2511.10566v1 | •Gradients Explain LN’s Impact:We explain the divergent impacts of LN in Pre- and Post-LN models by comparing learning and memorization gradients, which reveal why LN parameter removal causes learning disruption and memorization suppression in Pre- and Post-LN models, respectively. | supported | has evidence row | full-text | Consistent trends are observed for other Pre-LN (GPT2, Qwen2, ViT-B, DeiT, ViT-S) and Post-LN (RoBERTa, BERT, Longformer, ELECTRA) models, illustrated in Appendix G.3. 6.2 Why are LNs in Early Layers Important for Memorization and Learning? | p2 | in-scope: taxonomy category match | pass; full-text verified; report=audit_report.md |
| 2601.19895v2 | We show that the central failure mode of Post-LN arises from the ResNet-style residual pathway, which introduces gradient vanishing in deep networks. | supported | has evidence row | full-text | On GSM-8K, the gap remains over 10 points (68.8 vs. 58.7), and on MMLU-Pro which requires nuanced understanding and robust instruction following,Keel achieves a score of 35.6 compared to the baseline’s 26.6. | p1 | in-scope: taxonomy category match | pass; full-text verified; report=audit_report.md |
| 2602.18849v1 | Using this framework, we prove that pre-LN preserves identity gradient paths while post-LN compounds LayerNorm Jacobians exponentially with depth, and we show that DeepNorm’s N−1/4emerges from the quartic structure of attention’s four projection matrices. | supported | has evidence row | full-text | (4) Pre-LN vs post-LN mechanism.We prove pre-LN pre- serves an additive identity gradient path (bypassing sublayer Jacobians) while post-LN forces all gradients through Lay- erNorm Jacobians, causing exponential decay with depth (Theorems 5.4, 5.5). | p1 | in-scope: taxonomy category match | pass; full-text verified; report=audit_report.md |
| 2511.13250v1 | Edge-aware baselines forogbn-proteinsin PyTorch Geometric Species-wise normalization, post-hoc calibration, and cost–accuracy trade-offs Aleksandar Stanković∗Dejan Lisica† November 18, 2025 Abstract We present reproducible, edge-aware baselines for ogbn-proteins in PyTorch Geometric (PyG). | supported | has evidence row | full-text | Edge-aware baselines forogbn-proteinsin PyTorch Geometric Species-wise normalization, post-hoc calibration, and cost–accuracy trade-offs Aleksandar Stanković∗Dejan Lisica† November 18, 2025 Abstract We present reproducible, edge-aware baselines for ogbn-proteins in PyTorch Geometric (PyG). | p1 | in-scope: taxonomy category match | pass; full-text verified; report=audit_report.md |