Repository navigation

#

flash-attention

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.

Python
19092
15 天前
ymcui/Chinese-LLaMA-Alpaca-2

中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)

Python
7175
1 个月前

Official release of InternLM series (InternLM, InternLM2, InternLM2.5, InternLM3).

Python
7027
1 个月前
xlite-dev/LeetCUDA

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda
6311
17 小时前
xlite-dev/Awesome-LLM-Inference

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

Python
4403
17 小时前

MoBA: Mixture of Block Attention for Long-Context LLMs

Python
1864
5 个月前

InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.

Python
403
1 个月前

[CVPR 2025 Highlight] The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss". A super memory-efficiency CLIP training scheme.

Python
262
7 个月前

🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.

Cuda
207
12 天前

Triton implementation of FlashAttention2 that adds Custom Masks.

Python
132
1 年前

Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.

Python
98
2 年前

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

C++
40
6 个月前

Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.

C++
40
2 个月前

Fast and memory efficient PyTorch implementation of the Perceiver with FlashAttention.

Python
26
9 个月前

Python package for rematerialization-aware gradient checkpointing

Python
25
2 年前

A flexible and efficient implementation of Flash Attention 2.0 for JAX, supporting multiple backends (GPU/TPU/CPU) and platforms (Triton/Pallas/JAX).

Python
23
6 个月前

Utilities for efficient fine-tuning, inference and evaluation of code generation models

Python
21
2 年前
Jupyter Notebook
20
2 年前

Flash Attention & friends in pure Julia

Julia
12
9 天前