Repository navigation

#

evaluation-framework

A framework for few-shot evaluation of language models.

Python
8664
1 天前

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

TypeScript
6232
17 小时前

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

Python
1436
2 天前

This is the repository of our article published in RecSys 2019 "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches" and of several follow-up studies.

Python
987
2 年前

AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.

Python
301
1 天前

Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.

Python
294
5 个月前

Python SDK for running evaluations on LLM generated responses

Python
277
4 天前

Moonshot - A simple and modular tool to evaluate and red-team any LLM application.

Python
226
3 天前

A research library for automating experiments on Deep Graph Networks

Python
221
7 个月前
Svelte
215
2 年前
Python
177
7 个月前

Expressive is a cross-platform expression parsing and evaluation framework. The cross-platform nature is achieved through compiling for .NET Standard so it will run on practically any platform.

C#
168
7 个月前

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.

Go
167
11 天前

Test and evaluate LLMs and model configurations, across all the scenarios that matter for your application

TypeScript
156
8 个月前

Evaluation suite for large-scale language models.

Python
125
4 年前

A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.

Python
122
9 小时前