Repository navigation

#

evaluation

mlflow/mlflow

The open source developer platform to build AI/LLM applications and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.

Python
21670
5 小时前
langfuse/langfuse

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

TypeScript
15152
4 小时前

Supercharge Your LLM Application Evaluations 🚀

Python
10394
10 小时前

Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!

Python
8418
2 小时前

Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

TypeScript
8050
5 分钟前

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Python
5880
7 天前

Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to monitoring.

Go
4715
8 小时前
Marker-Inc-Korea/AutoRAG

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

Python
4191
2 个月前
Kiln-AI/Kiln
Python
4056
12 分钟前
MichaelGrupp/evo
Python
3920
18 天前

Arbitrary expression evaluation for golang

Go
3895
5 个月前

SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese

3241
4 个月前

Klipse is a JavaScript plugin for embedding interactive code snippets in tech blogs.

HTML
3132
1 年前

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

Python
2988
5 小时前

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

Python
2921
6 天前

An open-source visual programming environment for battle-testing prompts to LLMs.

TypeScript
2724
3 天前