Repository navigation

llm-evaluation

Website
Wikipedia

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

analytics 大语言模型 llmops large-language-models openai 自托管 ycombinator 监控 observability Open Source langchain llama-index evaluation prompt-engineering prompt-management playground llm-evaluation llm-observability autogen

TypeScript

10501

955

2 小时前

comet-ml / opik

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

Open Source langchain openai playground prompt-engineering llama-index 大语言模型 llm-evaluation llm-observability llmops

Python

6587

475

15 小时前

promptfoo / promptfoo

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

大语言模型 prompt-engineering prompts llmops prompt-testing Testing rag evaluation evaluation-framework llm-eval llm-evaluation llm-evaluation-framework 持续集成 CI/CD pentesting red-teaming vulnerability-scanners

TypeScript

6232

511

17 小时前

confident-ai / deepeval

The LLM Evaluation Framework

evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Python

6008

523

5 小时前

Arize-ai / phoenix

AI Observability & Evaluation

llmops ai-monitoring ai-observability llm-eval datasets agents llms prompt-engineering anthropic evals llm-evaluation openai langchain llamaindex

Jupyter Notebook

5417

398

1 小时前

Giskard-AI / giskard

🐢 Open-Source Evaluation & Testing for AI & LLM systems

mlops ml-validation ml-testing llmops responsible-ai fairness-ai llm-eval llm-evaluation rag-evaluation ai-security llm-security ai-red-team red-team-tools 大语言模型

Python

4477

317

1 天前

NVIDIA / garak

the LLM vulnerability scanner

人工智能 llm-evaluation llm-security security-scanners vulnerability-assessment

Python

4303

421

9 小时前

Marker-Inc-Korea / AutoRAG

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

analysis automl benchmarking document-parser embeddings evaluation 大语言模型 llm-evaluation llm-ops Open Source ops optimization pipeline Python qa rag rag-evaluation retrieval-augmented-generation

Python

3833

305

2 个月前

Helicone / helicone

🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓

large-language-models prompt-engineering agent-monitoring analytics evaluation gpt langchain llama-index 大语言模型 llm-cost llm-evaluation llm-observability llmops 监控 Open Source openai playground prompt-management ycombinator

TypeScript

3621

364

29 分钟前

PacktPublishing / LLM-Engineers-Handbook

The LLM's practical guide: From the fundamentals to deploying advanced LLM and RAG apps to AWS using LLMOps best practices

genai 大语言模型 llmops mlops rag Amazon Web Services fine-tuning-llm llm-evaluation ml-system-design

Python

3136

651

1 个月前

Agenta-AI / agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

llm-tools prompt-engineering prompt-management llm-evaluation llm-framework rag-evaluation llm-observability llm-as-a-judge llm-monitoring llm-platform llm-playground llmops-platform

TypeScript

2599

305

1 天前

lmnr-ai / lmnr

Laminar - open-source all-in-one platform for engineering AI products. Crate data flywheel for you AI app. Traces, Evals, Datasets, Labels. YC S24.

aiops developer-tools observability agents 人工智能 rag Rust analytics llm-evaluation llm-observability 监控 Open Source 自托管 ai-observability llmops evals evaluation

TypeScript

1861

113

6 小时前

msoedov / agentic_security

Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪

llm-security ai-red-team llm-evaluation llm-evaluation-framework prompt-testing agent-framework

Python

1296

204

4 天前

microsoft / prompty

Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers.

generative-ai llm-evaluation llms promptengineering

Python

859

2 天前

cyberark / FuzzyAI

A powerful tool for automated LLM fuzzing. It is designed to help developers and security researchers identify and mitigate potential jailbreaks in their LLM APIs.

jailbreak jailbreaking 大语言模型 llms 人工智能安全 Fuzzing/Fuzz testing llm-evaluation llm-security

Jupyter Notebook

517

17 天前

onejune2018 / Awesome-LLM-Eval

Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表，主要面向基础大模型评测，旨在探求生成式AI的技术边界.

benchmark bert chatglm ChatGPT dataset evaluation gpt3 large-language-model leaderboard 大语言模型机器学习自然语言处理 openai llama llm-evaluation qwen rag

513

6 个月前