Repository navigation

#

evals

The TypeScript AI agent framework. ⚡ Assistants, RAG, observability. Supports any LLM: GPT-4, Claude, Gemini, Llama.

TypeScript
15942
1 分钟前

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI

Python
4795
14 小时前
Kiln-AI/Kiln
Python
4056
14 分钟前

Laminar - open-source all-in-one platform for engineering AI products. Create data flywheel for your AI app. Traces, Evals, Datasets, Labels. YC S24.

TypeScript
2232
7 小时前

Evaluate your LLM-powered apps with TypeScript

TypeScript
788
25 天前

[NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understanding

Jupyter Notebook
153
1 个月前

Vivaria is METR's tool for running evaluations and conducting agent elicitation research.

TypeScript
107
2 天前

A Node.js package and GitHub Action for evaluating MCP (Model Context Protocol) tool implementations using LLM-based scoring. This helps ensure your MCP server's tools are working correctly and performing well.

TypeScript
80
2 个月前

An MCP Evaluation Library

TypeScript
42
15 小时前

Benchmarking Large Language Models for FHIR

TypeScript
39
1 个月前

A library for evaluating Retrieval-Augmented Generation (RAG) systems (The traditional ways).

Python
38
1 年前

Go Artificial Intelligence (GAI) helps you work with foundational models, large language models, and other AI models.

Go
27
1 天前

Code release for "CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning", ICLR 2025

Jupyter Notebook
26
4 个月前
Jupyter Notebook
26
4 个月前

An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"

Python
16
8 天前