Repository navigation

#

llm-evaluation-framework

Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

TypeScript
8050
几秒前
msoedov/agentic_security
Python
1619
5 天前

MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.

Python
81
8 个月前

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

Python
78
6 个月前

An easy python package to run quick basic QA evaluations. This package includes standardized QA evaluation metrics and semantic evaluation metrics: Black-box and Open-Source large language model prompting and evaluation, exact match, F1 Score, PEDANT semantic match, transformer match. Our package also supports prompting OPENAI and Anthropic API.

Python
52
1 个月前

Benchmarking Large Language Models for FHIR

TypeScript
39
1 个月前

[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

Python
37
1 年前

FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts

Python
18
10 个月前

Code for "Prediction-Powered Ranking of Large Language Models", NeurIPS 2024.

Jupyter Notebook
9
10 个月前

Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.

Jupyter Notebook
7
3 个月前

Measure of estimated confidence for non-hallucinative nature of outputs generated by Large Language Models.

Python
5
13 天前

TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

TypeScript
4
7 个月前

Multilingual Evaluation Toolkits

Python
4
9 个月前

Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing LLM Capabilities

Jupyter Notebook
3
12 天前