Repository navigation

flash-attention

Website
Wikipedia

QwenLM / Qwen

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.

chinese large-language-models 自然语言处理 flash-attention 大语言模型 pretrained-models

Python

19092

1583

15 天前

ymcui / Chinese-LLaMA-Alpaca-2

中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)

alpaca llama 大语言模型 llama-2 large-language-models 自然语言处理 alpaca-2 flash-attention llama2 alpaca2 Yarn rlhf

Python

7175

569

1 个月前

InternLM / InternLM

Official release of InternLM series (InternLM, InternLM2, InternLM2.5, InternLM3).

聊天机器人 gpt 大语言模型 long-context rlhf fine-tuning-llm chinese flash-attention pretrained-models

Python

7027

497

1 个月前

xlite-dev / LeetCUDA

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

CUDA cuda-kernels flash-attention cuda-library cuda-cpp

Cuda

6311

657

17 小时前

xlite-dev / Awesome-LLM-Inference

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

flash-attention tensorrt-llm vllm llm-inference deepseek deepseek-v3 deepseek-r1 qwen3

Python

4403

299

17 小时前

MoonshotAI / MoBA

MoBA: Mixture of Block Attention for Long-Context LLMs

flash-attention 大语言模型 llm-serving llm-training moe PyTorch transformer

Python

1864

113

5 个月前

InternLM / InternEvo

InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.

gemma internlm internlm2 llama3 llava llm-framework llm-training multi-modal pipeline-parallelism flash-attention PyTorch

Python

403

1 个月前

DAMO-NLP-SG / Inf-CLIP

[CVPR 2025 Highlight] The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss". A super memory-efficiency CLIP training scheme.

contrastive-learning flash-attention memory-efficient clip

Python

262

7 个月前

SmallDoges / flash-dmattn

Flash Dynamic Mask Attention

attention-is-all-you-need attention-mechanism cuda-kernels flash-attention pytorch-implementation self-attention cutlass PyTorch transformer transformers triton chinese english

C++

216

2 天前

xlite-dev / ffpa-attn

🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.

attention CUDA flash-attention mlsys deepseek deepseek-r1 deepseek-v3

Cuda

207

12 天前

alexzhang13 / flashattention2-custom-mask

Triton implementation of FlashAttention2 that adds Custom Masks.

attention attention-mechanism cuda-kernels 深度学习 flash-attention triton

Python

132

1 年前

CoinCheung / gdGPT

Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.

deepspeed 大语言模型 pipeline 自然语言处理 PyTorch bloom flash-attention baichuan2-7b mixtral-8x7b llama2

Python

2 年前

Bruce-Lee-LY / flash_attention_inference

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

CUDA flash-attention gpu inference 大语言模型 Nvidia cutlass mha

C++

6 个月前

Bruce-Lee-LY / decoding_attention

Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.

CUDA gpu inference 大语言模型 mha Nvidia flash-attention

C++

2 个月前

kklemon / FlashPerceiver

Fast and memory efficient PyTorch implementation of the Perceiver with FlashAttention.

attention-mechanism 深度学习 flash-attention 自然语言处理 transformer

Python

9 个月前

RulinShao / FastCkpt

Python package for rematerialization-aware gradient checkpointing

flash-attention

Python

2 年前

erfanzar / jax-flash-attn2

A flexible and efficient implementation of Flash Attention 2.0 for JAX, supporting multiple backends (GPU/TPU/CPU) and platforms (Triton/Pallas/JAX).

flash-attention jax

Python

6 个月前

Naman-ntc / FastCode

Utilities for efficient fine-tuning, inference and evaluation of code generation models

code-generation efficient finetuning inference transformers flash-attention

Python

2 年前

kyegomez / FlashMHA

An simple pytorch implementation of Flash MultiHead Attention

人工智能 artificial-neural-networks attention attention-mechanisms gpt4 transformer flash-attention

Jupyter Notebook

2 年前

pxl-th / NNop.jl

Flash Attention & friends in pure Julia

gpgpu gpu Julia 语言 amdgpu CUDA flash-attention

Julia

9 天前