Repository navigation

llm-as-a-judge

Website
Wikipedia

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

llm-tools prompt-engineering prompt-management llm-evaluation llm-framework rag-evaluation llm-observability llm-as-a-judge llm-monitoring llm-platform llm-playground llmops-platform

Python

3091

360

12 天前

prometheus-eval / prometheus-eval

Evaluate your LLM's response with Prometheus and GPT4 💯

evaluation litellm 大语言模型 llmops Python vllm gpt4 llm-as-a-judge

Python

979

4 个月前

metauto-ai / agent-as-a-judge

⚖️ The First Coding Agent-as-a-Judge

llm-as-a-judge 大语言模型

Python

601

3 个月前

haizelabs / verdict

Inference-time scaling for LLMs-as-a-judge.

大语言模型 llm-as-a-judge

Jupyter Notebook

276

1 个月前

IAAR-Shanghai / xFinder

[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

evaluation gpt 大语言模型 large-language-models Regular expression reliability benchmark dataset chatglm phi qwen llm-as-a-judge

Python

176

6 个月前

IAAR-Shanghai / xVerify

xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

llm-as-a-judge benchmark evaluation Regular expression reliability ChatGPT 大语言模型 open-r1

Python

127

4 个月前

martin-wey / CodeUltraFeedback

CodeUltraFeedback: aligning large language models to coding preferences

alignment code-generation dpo large-language-models llm-as-a-judge

Python

1 年前

KID-22 / LLM-IR-Bias-Fairness-Survey

This is the repo for the survey of Bias and Fairness in IR with LLMs.

bias fairness information-retrieval large-language-models recommender-systems ChatGPT 大语言模型 llm-as-a-judge

4 个月前

MJ-Bench / MJ-Bench

Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"

llm-as-a-judge

Jupyter Notebook

3 个月前

whitecircle-ai / circle-guard-bench

First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)

人工智能 benchmark 大语言模型 large-language-models llm-eval llm-evaluation guardrails benchmarking guardrail jailbreak llm-as-a-judge llm-security

Python

22 天前

lupantech / ineqmath

Solving Inequality Proofs with Large Language Models.

llm-as-a-judge 大语言模型 theorem-proving

Python

2 天前

docling-project / docling-sdg

A set of tools to create synthetically-generated data from documents

人工智能 documents llm-as-a-judge question-answering sdg

Python

5 天前

zhaochen0110 / Timo

Code and data for "Timo: Towards Better Temporal Reasoning for Language Models" (COLM 2024)

llm-as-a-judge 大语言模型 rlhf

Python

10 个月前

minnesotanlp / cobbler

Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

bias evaluation 大语言模型自然语言处理 bias-detection llm-as-a-judge llm-evaluation

Jupyter Notebook

2 年前

PKU-ONELab / Themis

The official repository for our EMNLP 2024 paper, Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability.

evaluation llm-as-a-judge nlg

Python

6 个月前

OussamaSghaier / CuREV

Harnessing Large Language Models for Curated Code Reviews

代码审查 large-language-models llm-as-a-judge

Python

5 个月前

root-signals / rs-sdk

Root Signals SDK

evaluation 大语言模型 llm-as-a-judge observability evals

Python

2 天前

aws-samples / genai-system-evaluation

A set of examples demonstrating how to evaluate Generative AI augmented systems using traditional information retrieval and LLM-As-A-Judge validation techniques

genai generative-ai information-retrieval llm-as-a-judge llm-evaluation

Jupyter Notebook

1 年前

HillPhelmuth / LlmAsJudgeEvalPlugins

LLM-as-judge evals as Semantic Kernel Plugins

llm-as-a-judge llm-evaluation semantickernel

3 个月前

UMass-Meta-LLM-Eval / llm_eval

A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup that reveals new results about its strengths and weaknesses.

large-language-models llm-as-a-judge 自然语言处理

Python

1 年前