Repository navigation
linear-attention
- Website
- Wikipedia
RWKV (pronounced RwaKuv) is an RNN with great LLM performance, which can also be directly trained like a GPT transformer (parallelizable). We are at RWKV-7 "Goose". So it's combining the best of RNN and transformer - great performance, linear time, constant space (no kv-cache), fast training, infinite ctx_len, and free sentence embedding.
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models
[NeurIPS 2024] Official code of ”LION: Linear Group RNN for 3D Object Detection in Point Clouds“
Explorations into the recently proposed Taylor Series Linear Attention
Implementation of Agent Attention in Pytorch
The semantic segmentation of remote sensing images
[NeurIPS 2025 Oral] Exploring Diffusion Transformer Designs via Grafting
The semantic segmentation of remote sensing images
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse–Linear Attention
CUDA implementation of autoregressive linear attention, with all the latest research findings
Offical implementation of "MetaLA: Unified Optimal Linear Approximation to Softmax Attention Map" (NeurIPS2024 Oral)
Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)
Code for the paper "Cottention: Linear Transformers With Cosine Attention"
Implementation of: Hydra Attention: Efficient Attention with Many Heads (https://arxiv.org/abs/2209.07484)
RWKV Wiki website (archived, please visit official wiki)
[ICML 2024] Official implementation of "LeaPformer: Enabling Linear Transformers for Autoregressive and Simultaneous Tasks via Learned Proportions."
Official Implementation of SEA: Sparse Linear Attention with Estimated Attention Mask (ICLR 2024)
LEAP: Linear Explainable Attention in Parallel for causal language modeling with O(1) path length, and O(1) inference
🔍 Enhance your workflow with Houtini LM, an MCP server that offloads code analysis and documentation tasks to LM Studio, streamlining your development process.
SAUTE is a lightweight transformer-based architecture adapted for dialog modeling