Repository navigation

#

flash-attention

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.

Python
17921
23 天前
ymcui/Chinese-LLaMA-Alpaca-2

中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)

Python
7159
7 个月前

Official release of InternLM series (InternLM, InternLM2, InternLM2.5, InternLM3).

Python
6868
2 个月前
xlite-dev/Awesome-LLM-Inference

📚A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism etc.

Python
3857
2 天前
xlite-dev/CUDA-Learn-Notes

📚Modern CUDA Learn Notes: 200+ Tensor/CUDA Cores Kernels🎉, HGEMM, FA2 via MMA and CuTe, 98~100% TFLOPS of cuBLAS/FA2.

Cuda
3494
5 天前

FlashInfer: Kernel Library for LLM Serving

Cuda
2690
2 天前

MoBA: Mixture of Block Attention for Long-Context LLMs

Python
1748
17 天前

InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.

Python
380
2 天前

[CVPR 2025 Highlight] The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss". A super memory-efficiency CLIP training scheme.

Python
240
3 个月前

📚FFPA(Split-D): Yet another Faster Flash Attention with O(1) GPU SRAM complexity large headdim, 1.8x~3x↑🎉 faster than SDPA EA.

Cuda
168
14 天前

Triton implementation of FlashAttention2 that adds Custom Masks.

Python
109
8 个月前

Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.

Python
95
1 年前

Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.

C++
36
18 天前

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

C++
35
2 个月前

Fast and memory efficient PyTorch implementation of the Perceiver with FlashAttention.

Python
26
6 个月前

Python package for rematerialization-aware gradient checkpointing

Python
24
1 年前

A flexible and efficient implementation of Flash Attention 2.0 for JAX, supporting multiple backends (GPU/TPU/CPU) and platforms (Triton/Pallas/JAX).

Python
23
2 个月前

Utilities for efficient fine-tuning, inference and evaluation of code generation models

Python
21
2 年前
Jupyter Notebook
21
1 年前

🚀 Automated deployment stack for AMD MI300 GPUs with optimized ML/DL frameworks and HPC-ready configurations

Shell
11
5 个月前