Evidence-Ledger Synthesis of Attention Variants and Linear Attention in LLMs

paper_id	claim	claim_status	evidence_status	source_depth	source_quote	page_or_section	taxonomy_fit	audit_status
2502.18845v2	To this end, we introduce SWAT , which enables efficient longcontext handling via Sliding Window Attention Training.	supported	has evidence row	full-text	During in- ference, SWAT maintains linear computational complexity through sliding window attention while preserving model performance, achieving state-of-the-art (SOTA) results on eight com- monsense reasoning benchmarks compared to mainstream linear recurrent architectures.	p1	in-scope: explicit topic match	pass; full-text verified; report=audit_report.md
2406.14963v1	To mitigate the quality degradation, we propose asymmetric GQA (AsymGQA), an activation-informed merging approach that considers similarity between layers.	supported	has evidence row	full-text	To mitigate the quality degradation, we propose asymmetric GQA (AsymGQA), an activation-informed merging approach that considers similarity between layers.	p1	in-scope: explicit topic match	pass; full-text verified; report=audit_report.md
2511.14712v2	• We identify the limitations of prior resolution extrapolation methods for image generation when applied to videos and propose an inward sliding-window attention mechanism complemented by a dual-path strategy to effectively overcome these issues.	supported	has evidence row	full-text	Ablation Study Contribution of Proposed Strategy.To validate the ef- fectiveness of our proposed method, we conduct detailed ablation studies on Wan2.1-1.3B [43] as shown in Tab. 3.	p1	in-scope: taxonomy category match	pass; full-text verified; report=audit_report.md
2512.10411v5	To address these dual challenges, we propose Sliding Window Attention Adaptation (SWAA), a plug and play toolkit of recipes that adapts FA models to SWA without costly pretraining.	supported	has evidence row	full-text	For instance, configuring odd numbered layers with FA vastly outperforms even numbered ones for Qwen3-4B (row 13 vs. row 11).	p1	in-scope: explicit topic match	pass; full-text verified; report=audit_report.md
2512.07011v1	We propose Block-Sparse FlashAttention (BSFA), which takes a fundamentally different approach: compute all querykey scores exactly to determine importance, then use these scores to decide which value blocks to process.	supported	has evidence row	full-text	We propose Block-Sparse FlashAttention (BSFA), which takes a fundamentally different approach: compute all querykey scores exactly to determine importance, then use these scores to decide which value blocks to process.	p2	in-scope: explicit topic match	pass; full-text verified; report=audit_report.md