Repository navigation
evals
- Website
- Wikipedia
The TypeScript AI agent framework. ⚡ Assistants, RAG, observability. Supports any LLM: GPT-4, Claude, Gemini, Llama.
AI Observability & Evaluation
Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI
The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.
Evaluation and Tracking for LLM Experiments and AI Agents
Laminar - open-source all-in-one platform for engineering AI products. Create data flywheel for your AI app. Traces, Evals, Datasets, Labels. YC S24.
🥤 RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with DuckDB or PostgreSQL
Evaluate your LLM-powered apps with TypeScript
[NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understanding
A Node.js package and GitHub Action for evaluating MCP (Model Context Protocol) tool implementations using LLM-based scoring. This helps ensure your MCP server's tools are working correctly and performing well.
Evalica, your favourite evaluation toolkit
Benchmarking Large Language Models for FHIR
A library for evaluating Retrieval-Augmented Generation (RAG) systems (The traditional ways).
Go Artificial Intelligence (GAI) helps you work with foundational models, large language models, and other AI models.
Code release for "CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning", ICLR 2025
An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"