Evidence-ledger draft

Evidence-Ledger Synthesis of Attention Variants and Linear Attention in LLMs — Claim ledger

CSV-backed claim ledger tying paper claims to paper IDs and evidence status.

paper_idclaimclaim_statusevidence_statussource_depthsource_quotepage_or_sectiontaxonomy_fitaudit_status
2502.18845v2To this end, we introduce SWAT , which enables efficient longcontext handling via Sliding Window Attention Training.supportedhas evidence rowfull-textDuring in- ference, SWAT maintains linear computational complexity through sliding window attention while preserving model performance, achieving state-of-the-art (SOTA) results on eight com- monsense reasoning benchmarks compared to mainstream linear recurrent architectures.p1in-scope: explicit topic matchpass; full-text verified; report=audit_report.md
2406.14963v1To mitigate the quality degradation, we propose asymmetric GQA (AsymGQA), an activation-informed merging approach that considers similarity between layers.supportedhas evidence rowfull-textTo mitigate the quality degradation, we propose asymmetric GQA (AsymGQA), an activation-informed merging approach that considers similarity between layers.p1in-scope: explicit topic matchpass; full-text verified; report=audit_report.md
2511.14712v2• We identify the limitations of prior resolution extrapolation methods for image generation when applied to videos and propose an inward sliding-window attention mechanism complemented by a dual-path strategy to effectively overcome these issues.supportedhas evidence rowfull-textAblation Study Contribution of Proposed Strategy.To validate the ef- fectiveness of our proposed method, we conduct detailed ablation studies on Wan2.1-1.3B [43] as shown in Tab. 3.p1in-scope: taxonomy category matchpass; full-text verified; report=audit_report.md
2512.10411v5To address these dual challenges, we propose Sliding Window Attention Adaptation (SWAA), a plug and play toolkit of recipes that adapts FA models to SWA without costly pretraining.supportedhas evidence rowfull-textFor instance, configuring odd numbered layers with FA vastly outperforms even numbered ones for Qwen3-4B (row 13 vs. row 11).p1in-scope: explicit topic matchpass; full-text verified; report=audit_report.md
2512.07011v1We propose Block-Sparse FlashAttention (BSFA), which takes a fundamentally different approach: compute all querykey scores exactly to determine importance, then use these scores to decide which value blocks to process.supportedhas evidence rowfull-textWe propose Block-Sparse FlashAttention (BSFA), which takes a fundamentally different approach: compute all querykey scores exactly to determine importance, then use these scores to decide which value blocks to process.p2in-scope: explicit topic matchpass; full-text verified; report=audit_report.md