Repository navigation
evaluation
- Website
- Wikipedia
The open source developer platform to build AI/LLM applications and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.
🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
🤘 awesome-semantic-segmentation
Supercharge Your LLM Application Evaluations 🚀
Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!
Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to monitoring.
🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.
Python package for the evaluation of odometry and SLAM
Arbitrary expression evaluation for golang
Building a modern functional compiler from first principles. (http://dev.stephendiehl.com/fun/)
SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese
Klipse is a JavaScript plugin for embedding interactive code snippets in tech blogs.
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
End-to-end Automatic Speech Recognition for Madarian and English in Tensorflow
An open-source visual programming environment for battle-testing prompts to LLMs.