Repository navigation

kv-cache

Website
Wikipedia

LMCache / LMCache

Supercharge Your LLM with the Fastest KV Cache Layer

amd CUDA inference kv-cache 大语言模型 PyTorch rocm vllm fast speed

Python

4803

526

8 小时前

HDT3213 / godis

A Golang implemented Redis Server and Cluster. Go 语言实现的 Redis 服务器和分布式集群

kv-cache Go redis-server Redis godis cluster redis-cluster

3745

596

1 个月前

MemTensor / MemOS

MemOS (Preview) | Intelligence Begins with Memory

agent kv-cache language-model 大语言模型 lora memory Neo4j tree llm-memory long-term-memory memory-management rag retrieval-augmented-generation

Python

2308

197

18 小时前

Zefan-Cai / KVCache-Factory

Unified KV Cache Compression Methods for Auto-Regressive Models

kv-cache 大语言模型

Python

1222

159

8 个月前

harleyszhang / llm_note

LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.

大语言模型 llm-inference vllm cuda-programming kv-cache transformer-models

Python

812

11 天前

therealoliver / Deepdive-llama3-from-scratch

Achieve the llama3 inference step-by-step, grasp the core concepts, master the process derivation, implement the code.

inference kv-cache llama 大语言模型 attention attention-mechanism gpt language-model mask Parsing transformer

Jupyter Notebook

605

6 个月前

NVIDIA / kvpress

LLM KV cache compression made easy

大语言模型 inference kv-cache long-context Python PyTorch transformers large-language-models

Python

583

6 天前

FMInference / H2O

[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.

gpt-3 high-throughput kv-cache large-language-models sparsity

Python

466

1 年前

raymin0223 / mixture_of_recursions

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

kv-cache 大语言模型 router

Python

408

15 天前

dipampaul17 / KVSplit

Run larger LLMs with longer contexts on Apple Silicon by using differentiated precision for KV cache quantization. KVSplit enables 8-bit keys & 4-bit values, reducing memory by 59% with <1% quality loss. Includes benchmarking, visualization, and one-command setup. Optimized for M1/M2/M3 Macs with Metal support.

apple-silicon generative-ai kv-cache llama-cpp 大语言模型 m1 m3 memory-optimization metal optimization quantization

Python

356

3 个月前

Zefan-Cai / Awesome-LLM-KV-Cache

Awesome-LLM-KV-Cache: A curated list of 📙Awesome LLM KV Cache Papers with Codes.

kv-cache 大语言模型

351

6 个月前

NVIDIA-Merlin / HierarchicalKV

HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of HierarchicalKV is to store key-value feature-embeddings on high-bandwidth memory (HBM) of GPUs and in host memory. It also can be used as a generic key-value storage.

CUDA gpu hashtable recommender-system key-value-store kv-cache

Cuda

165

2 天前

itsnamgyu / block-transformer

Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)

kv-cache 大语言模型 llm-inference

Python

160

4 个月前

kddubey / cappr

Completion After Prompt Probability. Make your LLM make a choice

text-classification zero-shot huggingface prompt-engineering llamacpp probability llm-inference kv-cache

Python

10 个月前

aju22 / LLaMA2

This repository contains an implementation of the LLaMA 2 (Large Language Model Meta AI) model, a Generative Pretrained Transformer (GPT) variant. The implementation focuses on the model architecture and the inference process. The code is restructured and heavily commented to facilitate easy understanding of the key parts of the architecture.

attention gpt kv-cache llama llama2 大语言模型自然语言处理 transformer

Python

2 年前