Repository navigation

#

kv-cache

Supercharge Your LLM with the Fastest KV Cache Layer

Python
4803
8 小时前

A Golang implemented Redis Server and Cluster. Go 语言实现的 Redis 服务器和分布式集群

Go
3745
1 个月前

Unified KV Cache Compression Methods for Auto-Regressive Models

Python
1222
8 个月前

LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.

Python
812
11 天前

Achieve the llama3 inference step-by-step, grasp the core concepts, master the process derivation, implement the code.

Jupyter Notebook
605
6 个月前

[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.

Python
466
1 年前

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

Python
408
15 天前

Run larger LLMs with longer contexts on Apple Silicon by using differentiated precision for KV cache quantization. KVSplit enables 8-bit keys & 4-bit values, reducing memory by 59% with <1% quality loss. Includes benchmarking, visualization, and one-command setup. Optimized for M1/M2/M3 Macs with Metal support.

Python
356
3 个月前

Awesome-LLM-KV-Cache: A curated list of 📙Awesome LLM KV Cache Papers with Codes.

351
6 个月前

HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of HierarchicalKV is to store key-value feature-embeddings on high-bandwidth memory (HBM) of GPUs and in host memory. It also can be used as a generic key-value storage.

Cuda
165
2 天前

Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)

Python
160
4 个月前
Python
80
10 个月前

This repository contains an implementation of the LLaMA 2 (Large Language Model Meta AI) model, a Generative Pretrained Transformer (GPT) variant. The implementation focuses on the model architecture and the inference process. The code is restructured and heavily commented to facilitate easy understanding of the key parts of the architecture.

Python
70
2 年前

Easy control for Key-Value Constrained Generative LLM Inference(https://arxiv.org/abs/2402.06262)

Python
63
2 年前

KV Cache Steering for Inducing Reasoning in Small Language Models

Python
39
1 个月前

PiKV: KV Cache Management System for Mixture of Experts [Efficient ML System]

Python
29
2 天前