Repository navigation

#

llm-eval

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

TypeScript
6232
17 小时前

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

Python
2257
8 个月前

Python SDK for running evaluations on LLM generated responses

Python
277
4 天前

Generate ideal question-answers for testing RAG

Python
126
2 个月前

A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.

Python
85
1 年前

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

Python
76
2 个月前

Бенчмарк сравнивает русские аналоги ChatGPT: Saiga, YandexGPT, Gigachat

Jupyter Notebook
61
2 年前

🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it understands context, its tone, and more. This helps you see how good your LLM applications are.

Python
36
3 个月前

This is an opensource project allowing you to compare two LLM's head to head with a given prompt, this section will be regarding the backend of this project, allowing for llm api's to be incorporated and used in the front-end

Python
20
3 天前

Code for "Prediction-Powered Ranking of Large Language Models", NeurIPS 2024.

Jupyter Notebook
9
6 个月前

Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.

Jupyter Notebook
7
3 个月前

The prompt engineering, prompt management, and prompt evaluation tool for Python

Python
7
7 个月前

The prompt engineering, prompt management, and prompt evaluation tool for TypeScript, JavaScript, and NodeJS.

TypeScript
6
7 个月前