Repository navigation

#

llm-as-a-judge

Agenta-AI/agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

Python
3091
12 天前

Evaluate your LLM's response with Prometheus and GPT4 💯

Python
979
4 个月前

⚖️ The First Coding Agent-as-a-Judge

Python
601
3 个月前

Inference-time scaling for LLMs-as-a-judge.

Jupyter Notebook
276
1 个月前

[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

Python
176
6 个月前

xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

Python
127
4 个月前

CodeUltraFeedback: aligning large language models to coding preferences

Python
71
1 年前

Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"

Jupyter Notebook
46
3 个月前

First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)

Python
40
22 天前

Solving Inequality Proofs with Large Language Models.

Python
38
2 天前

A set of tools to create synthetically-generated data from documents

Python
25
5 天前

Code and data for "Timo: Towards Better Temporal Reasoning for Language Models" (COLM 2024)

Python
23
10 个月前

Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

Jupyter Notebook
21
2 年前

The official repository for our EMNLP 2024 paper, Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability.

Python
20
6 个月前

Harnessing Large Language Models for Curated Code Reviews

Python
14
5 个月前

A set of examples demonstrating how to evaluate Generative AI augmented systems using traditional information retrieval and LLM-As-A-Judge validation techniques

Jupyter Notebook
9
1 年前

LLM-as-judge evals as Semantic Kernel Plugins

C#
8
3 个月前

A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup that reveals new results about its strengths and weaknesses.

Python
8
1 年前