| 2502.18845v2 | To this end, we introduce SWAT , which enables efficient longcontext handling via Sliding Window Attention Training. | supported | has evidence row | full-text | During in- ference, SWAT maintains linear computational complexity through sliding window attention while preserving model performance, achieving state-of-the-art (SOTA) results on eight com- monsense reasoning benchmarks compared to mainstream linear recurrent architectures. | p1 | in-scope: explicit topic match | pass; full-text verified; report=audit_report.md |
| 2406.14963v1 | To mitigate the quality degradation, we propose asymmetric GQA (AsymGQA), an activation-informed merging approach that considers similarity between layers. | supported | has evidence row | full-text | To mitigate the quality degradation, we propose asymmetric GQA (AsymGQA), an activation-informed merging approach that considers similarity between layers. | p1 | in-scope: explicit topic match | pass; full-text verified; report=audit_report.md |
| 2511.14712v2 | • We identify the limitations of prior resolution extrapolation methods for image generation when applied to videos and propose an inward sliding-window attention mechanism complemented by a dual-path strategy to effectively overcome these issues. | supported | has evidence row | full-text | Ablation Study Contribution of Proposed Strategy.To validate the ef- fectiveness of our proposed method, we conduct detailed ablation studies on Wan2.1-1.3B [43] as shown in Tab. 3. | p1 | in-scope: taxonomy category match | pass; full-text verified; report=audit_report.md |
| 2512.10411v5 | To address these dual challenges, we propose Sliding Window Attention Adaptation (SWAA), a plug and play toolkit of recipes that adapts FA models to SWA without costly pretraining. | supported | has evidence row | full-text | For instance, configuring odd numbered layers with FA vastly outperforms even numbered ones for Qwen3-4B (row 13 vs. row 11). | p1 | in-scope: explicit topic match | pass; full-text verified; report=audit_report.md |
| 2512.07011v1 | We propose Block-Sparse FlashAttention (BSFA), which takes a fundamentally different approach: compute all querykey scores exactly to determine importance, then use these scores to decide which value blocks to process. | supported | has evidence row | full-text | We propose Block-Sparse FlashAttention (BSFA), which takes a fundamentally different approach: compute all querykey scores exactly to determine importance, then use these scores to decide which value blocks to process. | p2 | in-scope: explicit topic match | pass; full-text verified; report=audit_report.md |