Repository navigation

#

evaluation-framework

A framework for few-shot evaluation of language models.

Python
9863
4 天前

Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

TypeScript
8051
3 分钟前

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

Python
1826
9 小时前

This is the repository of our article published in RecSys 2019 "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches" and of several follow-up studies.

Python
986
2 年前

AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.

Python
385
41 分钟前

Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.

Python
317
1 个月前

Python SDK for running evaluations on LLM generated responses

Python
292
2 个月前

Moonshot - A simple and modular tool to evaluate and red-team any LLM application.

Python
265
15 天前

A research library for automating experiments on Deep Graph Networks

Python
223
13 天前
Svelte
216
2 年前

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.

Go
180
3 个月前
Python
179
1 年前

Expressive is a cross-platform expression parsing and evaluation framework. The cross-platform nature is achieved through compiling for .NET Standard so it will run on practically any platform.

C#
172
1 年前

A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.

Python
168
4 天前

Test and evaluate LLMs and model configurations, across all the scenarios that matter for your application

TypeScript
160
1 年前

MedEvalKit: A Unified Medical Evaluation Framework

Python
128
2 天前