Repository navigation

llm-evaluation-framework

Website
Wikipedia

confident-ai / deepeval

The LLM Evaluation Framework

evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Python

10237

883

3 小时前

promptfoo / promptfoo

Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

大语言模型 prompt-engineering prompts llmops prompt-testing Testing rag evaluation evaluation-framework llm-eval llm-evaluation llm-evaluation-framework 持续集成 CI/CD pentesting red-teaming vulnerability-scanners

TypeScript

8050

658

几秒前

msoedov / agentic_security

Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪

llm-security ai-red-team llm-evaluation llm-evaluation-framework prompt-testing agent-framework

Python

1619

251

5 天前

JinjieNi / MixEval

The official evaluation suite and dynamic data release for MixEval.

benchmark evaluation evaluation-framework foundation-models 大语言模型 large-language-models large-multimodal-models llm-evaluation llm-evaluation-framework llm-inference

Python

244

9 个月前

cvs-health / langfair

LangFair is a Python library for conducting use-case level LLM bias and fairness assessments

人工智能 bias bias-detection fairness fairness-ai fairness-ml fairness-testing large-language-models 大语言模型 responsible-ai Python ai-safety llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Python

226

8 小时前

Addepto / contextcheck

MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.

大语言模型 llm-evaluation rag Testing chatbot-framework Open Source ai-chat ai-testing-tool large-language-models 持续集成 llm-evaluation-framework

Python

8 个月前

parea-ai / parea-sdk-py

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

大语言模型 llm-evaluation llm-tools llmops llm-eval llm-evaluation-framework prompt-engineering generative-ai good-first-issue 监控

Python

6 个月前

zli12321 / qa_metrics

An easy python package to run quick basic QA evaluations. This package includes standardized QA evaluation metrics and semantic evaluation metrics: Black-box and Open-Source large language model prompting and evaluation, exact match, F1 Score, PEDANT semantic match, transformer match. Our package also supports prompting OPENAI and Anthropic API.

大语言模型 llm-evaluation llm-evaluation-framework

Python

1 个月前