Repository navigation
kv-cache
- Website
- Wikipedia
A Golang implemented Redis Server and Cluster. Go 语言实现的 Redis 服务器和分布式集群
MemOS (Preview) | Intelligence Begins with Memory
LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.
Achieve the llama3 inference step-by-step, grasp the core concepts, master the process derivation, implement the code.
LLM KV cache compression made easy
[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
Run larger LLMs with longer contexts on Apple Silicon by using differentiated precision for KV cache quantization. KVSplit enables 8-bit keys & 4-bit values, reducing memory by 59% with <1% quality loss. Includes benchmarking, visualization, and one-command setup. Optimized for M1/M2/M3 Macs with Metal support.
HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of HierarchicalKV is to store key-value feature-embeddings on high-bandwidth memory (HBM) of GPUs and in host memory. It also can be used as a generic key-value storage.
Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)
Completion After Prompt Probability. Make your LLM make a choice
This repository contains an implementation of the LLaMA 2 (Large Language Model Meta AI) model, a Generative Pretrained Transformer (GPT) variant. The implementation focuses on the model architecture and the inference process. The code is restructured and heavily commented to facilitate easy understanding of the key parts of the architecture.
Notes about LLaMA 2 model
Easy control for Key-Value Constrained Generative LLM Inference(https://arxiv.org/abs/2402.06262)
KV Cache Steering for Inducing Reasoning in Small Language Models
PiKV: KV Cache Management System for Mixture of Experts [Efficient ML System]
Mistral and Mixtral (MoE) from scratch