Repository navigation

evals

Website
Wikipedia

mastra-ai / mastra

The TypeScript AI agent framework. ⚡ Assistants, RAG, observability. Supports any LLM: GPT-4, Claude, Gemini, Llama.

agents 人工智能 chatbots JavaScript 大语言模型 Next Node.js React TypeScript workflows evals mcp tts

TypeScript

12111

633

10 小时前

Arize-ai / phoenix

AI Observability & Evaluation

llmops ai-monitoring ai-observability llm-eval datasets agents llms prompt-engineering anthropic evals llm-evaluation openai langchain llamaindex

Jupyter Notebook

5417

398

2 小时前

AgentOps-AI / agentops

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including OpenAI Agents SDK, CrewAI, Langchain, Autogen, AG2, and CamelAI

agent agentops 人工智能 evals evaluation-metrics 大语言模型 anthropic autogen cost-estimation crewai groq langchain mistral ollama openai agents-sdk openai-agents

Python

4230

379

15 小时前

Kiln-AI / Kiln

The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.

人工智能 chain-of-thought collaboration fine-tuning 机器学习 macOS ollama openai prompt prompt-engineering Python rlhf synthetic-data Windows evals evaluation

Python

3390

235

14 小时前

lmnr-ai / lmnr

Laminar - open-source all-in-one platform for engineering AI products. Crate data flywheel for you AI app. Traces, Evals, Datasets, Labels. YC S24.

aiops developer-tools observability agents 人工智能 rag Rust analytics llm-evaluation llm-observability 监控 Open Source 自托管 ai-observability llmops evals evaluation

TypeScript

1861

113

6 小时前

superlinear-ai / raglite

🥤 RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with PostgreSQL or SQLite

hybrid-search 大语言模型 Markdown pdf rag retrieval-augmented-generation SQLite vector-search pgvector PostgreSQL reranker reranking tsvector late-chunking late-interaction colbert evals query-adapter chainlit

Python

918

1 天前

mattpocock / evalite

Test your LLM-powered apps with TypeScript. No API key required.

人工智能 evals TypeScript

TypeScript

528

1 天前

keshik6 / HourVideo

[NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understanding

gemini-pro gpt-4 multimodal-large-language-models navigation perception summarization reasoning evals

Jupyter Notebook

146

1 个月前

METR / vivaria

Vivaria is METR's tool for running evaluations and conducting agent elicitation research.

人工智能 evals

TypeScript

1 天前

AIAnytime / rag-evaluator

A library for evaluating Retrieval-Augmented Generation (RAG) systems (The traditional ways).

eval evals rag

Python

8 个月前

dustalov / evalica

Evalica, your favourite evaluation toolkit

evals evaluation Library 大语言模型 pyo3 Python Rust ranking rating 统计 leaderboard Hacktoberfest

Python

5 天前

flexpa / llm-fhir-eval

Benchmarking Large Language Models for FHIR

evals fhir 大语言模型 llm-evaluation-framework

5 个月前

NirantK / rag-to-riches

evals rag search

Jupyter Notebook

9 天前

maragudk / gai

Go Artificial Intelligence (GAI) helps you work with foundational models, large language models, and other AI models.

人工智能 Go 大语言模型 eval evals embeddings

19 天前

The-Swarm-Corporation / StatisticalModelEvaluator

An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"

agents 人工智能 evals llms 机器学习 multiagent

Python

13 天前

google / curie

Code release for "CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning", ICLR 2025

data evals 大语言模型 science

Jupyter Notebook

16 天前

root-signals / rs-python-sdk

Root Signals Python SDK

evaluation 大语言模型 llm-as-a-judge observability evals

Python

3 天前

openlayer-ai / templates

Our curated collection of templates. Use these patterns to set up your AI projects for evaluation with Openlayer.

人工智能 evals Example

Python

2 个月前

root-signals / root-signals-mcp

MCP for Root Signals Evaluation Platform

evals llm-as-a-judge mcp model-context-protocol agentic-ai

Python

2 天前

nstankov-bg / oaievals-collector

The OAIEvals Collector: A robust, Go-based metric collector for EVALS data. Supports Kafka, Elastic, Loki, InfluxDB, TimescaleDB integrations, and containerized deployment with Docker. Streamlines OAI-Evals data management efficiently with a low barrier of entry!

ChatGPT DevOps Docker Go openai evals

1 年前