Repository navigation

#

evaluation

langfuse/langfuse

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

TypeScript
10496
10 小时前

Supercharge Your LLM Application Evaluations 🚀

Python
8860
3 天前

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

TypeScript
6230
11 小时前

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Python
5206
1 天前

Arbitrary expression evaluation for golang

Go
3866
25 天前
Marker-Inc-Korea/AutoRAG

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

Python
3833
2 个月前
MichaelGrupp/evo
Python
3745
1 个月前
Kiln-AI/Kiln

The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.

Python
3390
7 小时前

SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese

3151
1 年前

Klipse is a JavaScript plugin for embedding interactive code snippets in tech blogs.

HTML
3127
7 个月前
Python
2597
15 小时前

An open-source visual programming environment for battle-testing prompts to LLMs.

TypeScript
2570
16 小时前

Accelerating the development of large multimodal models (LMMs) with one-click evaluation module - lmms-eval.

Python
2360
3 小时前

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

Python
2257
8 个月前

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

Python
2232
3 小时前

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

Python
2184
3 个月前