Repository navigation

#

llm-evaluation-framework

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

TypeScript
6232
17 小时前
msoedov/agentic_security
Python
1296
4 天前

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

Python
76
2 个月前

MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.

Python
64
4 个月前

[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

Python
36
9 个月前

Benchmarking Large Language Models for FHIR

29
5 个月前

FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts

Python
18
6 个月前

Code for "Prediction-Powered Ranking of Large Language Models", NeurIPS 2024.

Jupyter Notebook
9
6 个月前

Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.

Jupyter Notebook
7
3 个月前

TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

TypeScript
4
3 个月前

Shin Rakuda is a comprehensive framework for evaluating and benchmarking Japanese large language models, offering researchers and developers a flexible toolkit for assessing LLM performance across diverse datasets.

Python
3
7 个月前

Multilingual Evaluation Toolkits

Python
3
5 个月前

Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing LLM Capabilities

Jupyter Notebook
2
9 个月前

Sample project demonstrates how to use Promptfoo, a test framework for evaluating the output of generative AI models

1
7 个月前