Repository navigation

llm-eval

Website
Wikipedia

Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

大语言模型 prompt-engineering prompts llmops prompt-testing Testing rag evaluation evaluation-framework llm-eval llm-evaluation llm-evaluation-framework 持续集成 CI/CD pentesting red-teaming vulnerability-scanners

TypeScript

8050

658

几秒前

Arize-ai / phoenix

AI Observability & Evaluation

llmops ai-monitoring ai-observability llm-eval datasets agents 大语言模型 prompt-engineering anthropic evals llm-evaluation openai langchain llamaindex smolagents

Jupyter Notebook

6693

534

16 分钟前

Giskard-AI / giskard

🐢 Open-Source Evaluation & Testing library for LLM Agents

mlops ml-validation ml-testing llmops responsible-ai fairness-ai llm-eval llm-evaluation rag-evaluation ai-security llm-security ai-red-team red-team-tools 大语言模型

Python

4812

349

2 天前

truera / trulens

Evaluation and Tracking for LLM Experiments and AI Agents

机器学习 neural-networks explainable-ml llmops ai-monitoring ai-observability evals llm-evaluation 大语言模型 ai-agents llm-eval agentops

Python

2716

220

7 分钟前

iterative / datachain

ETL, Analytics, Versioning for Unstructured Data

人工智能 cv data-wrangling 大语言模型 llm-eval multimodal data-analytics embeddings mlops 机器学习

Python

2621

117

1 小时前

uptrain-ai / uptrain

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

机器学习 experimentation llm-prompting llmops 监控 prompt-engineering evaluation llm-eval

Python

2312

198

1 年前

athina-ai / athina-evals

Python SDK for running evaluations on LLM generated responses

evaluation evaluation-framework evaluation-metrics llm-eval llm-evaluation llm-ops llmops

Python

292

2 个月前

Re-Align / just-eval

A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.

evaluation gpt4 大语言模型 llm-eval llm-evaluation

Python

2 年前

parea-ai / parea-sdk-py

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

大语言模型 llm-evaluation llm-tools llmops llm-eval llm-evaluation-framework prompt-engineering generative-ai good-first-issue 监控

Python

6 个月前

kuk / rulm-sbs2

Бенчмарк сравнивает русские аналоги ChatGPT: Saiga, YandexGPT, Gigachat

llm-eval

Jupyter Notebook

2 年前

multinear / multinear

Develop reliable AI apps

evaluation 大语言模型 reliability llm-eval llm-evaluation llm-evaluation-framework

Python

2 个月前

whitecircle-ai / circle-guard-bench

First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)

人工智能 benchmark 大语言模型 large-language-models llm-eval llm-evaluation guardrails benchmarking guardrail jailbreak llm-as-a-judge llm-security

Python

22 天前

Auto-Playground / ragrank

🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it understands context, its tone, and more. This helps you see how good your LLM applications are.

evaluation language-model 大语言模型 llm-eval llmops 机器学习 prompt-engineering rag

Python

2 个月前

alan-turing-institute / prompto

An open source library for asynchronous querying of LLM endpoints

hut23 large-language-models llm-eval llm-evaluation 大语言模型 transformers 深度学习机器学习自然语言处理 Python transformer

Python

1 个月前

Supahands / llm-comparison-backend

This is an opensource project allowing you to compare two LLM's head to head with a given prompt, this section will be regarding the backend of this project, allowing for llm api's to be incorporated and used in the front-end

人工智能 ChatGPT 大语言模型 llm-eval

Python

1 个月前

genia-dev / vibraniumdome

LLM Security Platform.

adversarial-attacks ChatGPT 大语言模型 openai prompt-injection 安全 llm-agent llm-security llmops prompt-engineering prompts llm-framework llm-inference llm-serving llm-evaluation llm-eval

Python

10 个月前

honeyhiveai / realign

Realign is a testing and simulation framework for AI applications.

人工智能 alignment evaluation 大语言模型 prompt-engineering red-teaming Simulation llm-eval llm-evaluation llm-evaluation-framework llmops rag

Python

9 个月前

attogram / ollama-multirun

Run a prompt against all, or some, of your models running on Ollama. Creates web pages with the output, performance statistics and model info. All in a single Bash shell script.

人工智能 ollama llm-evaluation ollama-interface Bash ollama-app llm-evaluation-metrics llm-eval static-site-generator

Shell

1 天前