Repository navigation

#

evals

The TypeScript AI agent framework. ⚡ Assistants, RAG, observability. Supports any LLM: GPT-4, Claude, Gemini, Llama.

TypeScript
17049
2 小时前

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI

Python
4933
19 天前

Laminar - open-source all-in-one platform for engineering AI products. Create data flywheel for your AI app. Traces, Evals, Datasets, Labels. YC S24.

TypeScript
2320
18 小时前

Evaluate your LLM-powered apps with TypeScript

TypeScript
906
1 个月前

[NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understanding

Jupyter Notebook
156
3 个月前

Vivaria is METR's tool for running evaluations and conducting agent elicitation research.

TypeScript
111
1 天前

A Node.js package and GitHub Action for evaluating MCP (Model Context Protocol) tool implementations using LLM-based scoring. This helps ensure your MCP server's tools are working correctly and performing well.

TypeScript
108
3 个月前

An MCP Evaluation Library

TypeScript
44
15 天前

Benchmarking Large Language Models for FHIR

TypeScript
41
8 天前

A library for evaluating Retrieval-Augmented Generation (RAG) systems (The traditional ways).

Python
39
1 年前

Go Artificial Intelligence (GAI) helps you work with foundational models, large language models, and other AI models.

Go
33
1 个月前

Code release for "CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning", ICLR 2025

Jupyter Notebook
27
5 个月前
Jupyter Notebook
27
6 个月前

An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"

Python
15
1 个月前