Repository navigation

#

llm-as-a-judge

Agenta-AI/agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

TypeScript
2606
2 天前

Evaluate your LLM's response with Prometheus and GPT4 💯

Python
908
1 个月前

🤠 Agent-as-a-Judge and DevAI dataset

Python
400
3 个月前

[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

Python
165
2 个月前

CodeUltraFeedback: aligning large language models to coding preferences

Python
71
10 个月前

xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

Python
62
3 天前

Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"

Jupyter Notebook
43
2 个月前

Code and data for "Timo: Towards Better Temporal Reasoning for Language Models" (COLM 2024)

Python
21
6 个月前

The official repository for our EMNLP 2024 paper, Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability.

Python
20
2 个月前

Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

Jupyter Notebook
20
1 年前

Harnessing Large Language Models for Curated Code Reviews

Python
12
1 个月前

A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup that reveals new results about its strengths and weaknesses.

Python
8
7 个月前

A set of examples demonstrating how to evaluate Generative AI augmented systems using traditional information retrieval and LLM-As-A-Judge validation techniques

Jupyter Notebook
8
7 个月前

The official repository for our ACL 2024 paper: Are LLM-based Evaluators Confusing NLG Quality Criteria?

Python
7
2 个月前

A set of tools to create synthetically-generated data from documents

Python
6
5 天前

LLM-as-judge evals as Semantic Kernel Plugins

C#
6
3 个月前

MCP for Root Signals Evaluation Platform

Python
5
3 天前

Controversial Questions for Argumentation and Retrieval

Python
4
4 个月前