Repository navigation

evaluation

Website
Wikipedia

mrgloom / awesome-semantic-segmentation

🤘 awesome-semantic-segmentation

semantic-segmentation benchmark evaluation 深度学习

10648

2486

4 年前

langfuse / langfuse

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

analytics 大语言模型 llmops large-language-models openai 自托管 ycombinator 监控 observability Open Source langchain llama-index evaluation prompt-engineering prompt-management playground llm-evaluation llm-observability autogen

TypeScript

10496

955

10 小时前

explodinggradients / ragas

Supercharge Your LLM Application Evaluations 🚀

大语言模型 llmops evaluation

Python

8860

888

3 天前

promptfoo / promptfoo

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

大语言模型 prompt-engineering prompts llmops prompt-testing Testing rag evaluation evaluation-framework llm-eval llm-evaluation llm-evaluation-framework 持续集成 CI/CD pentesting red-teaming vulnerability-scanners

TypeScript

6230

511

11 小时前

open-compass / opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

evaluation benchmark large-language-model ChatGPT 大语言模型 llama2 openai llama3

Python

5206

543

1 天前

Knetic / govaluate

Arbitrary expression evaluation for golang

Go evaluation Parsing expression

3866

512

25 天前

Marker-Inc-Korea / AutoRAG

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

analysis automl benchmarking document-parser embeddings evaluation 大语言模型 llm-evaluation llm-ops Open Source ops optimization pipeline Python qa rag rag-evaluation retrieval-augmented-generation

Python

3833

304

2 个月前

MichaelGrupp / evo

Python package for the evaluation of odometry and SLAM

slam odometry evaluation 监控 Robotics trajectory benchmark ros kitti tum mapping ros2 trajectory-analysis

Python

3745

766

1 个月前

Helicone / helicone

🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓

large-language-models prompt-engineering agent-monitoring analytics evaluation gpt langchain llama-index 大语言模型 llm-cost llm-evaluation llm-observability llmops 监控 Open Source openai playground prompt-management ycombinator

TypeScript

3620

364

8 小时前

Kiln-AI / Kiln

The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.

人工智能 chain-of-thought collaboration fine-tuning 机器学习 macOS ollama openai prompt prompt-engineering Python rlhf synthetic-data Windows evals evaluation

Python

3390

235

7 小时前

sdiehl / write-you-a-haskell

Building a modern functional compiler from first principles. (http://dev.stephendiehl.com/fun/)

编译器 book evaluation lambda-calculus type type-checking type-system 函数式编程 functional-language type-inference type-theory intermediate-representation

Haskell

3376

256

4 年前

CLUEbenchmark / SuperCLUE

SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese

ChatGPT chinese evaluation foundation-models gpt-4

3151

105

1 年前

viebel / klipse

Klipse is a JavaScript plugin for embedding interactive code snippets in tech blogs.

Clojure ClojureScript JavaScript Ruby scheme prolog React codemirror-editor evaluation Python brainfuck Lua OCaml Reason Common Lisp

HTML

3127

148

7 个月前

zzw922cn / Automatic_Speech_Recognition

End-to-end Automatic Speech Recognition for Madarian and English in Tensorflow

automatic-speech-recognition Tensorflow timit-dataset feature-vector phonemes data-preprocessing rnn audio 深度学习 lstm end-to-end cnn evaluation Bukkit speech-recognition chinese-speech-recognition

Python

2842

534

2 年前

microsoft / promptbench

A unified evaluation framework for large language models

adversarial-attacks ChatGPT evaluation large-language-models robustness prompt prompt-engineering benchmark

Python

2597

191

15 小时前

ianarawjo / ChainForge

An open-source visual programming environment for battle-testing prompts to LLMs.

人工智能 evaluation large-language-models llmops llms prompt-engineering

TypeScript

2570

206

16 小时前

EvolvingLMMs-Lab / lmms-eval

Accelerating the development of large multimodal models (LMMs) with one-click evaluation module - lmms-eval.

agi evaluation large-language-models multimodal

Python

2360

257

3 小时前

uptrain-ai / uptrain

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

机器学习 experimentation llm-prompting llmops 监控 prompt-engineering evaluation llm-eval

Python

2257

198

8 个月前

open-compass / VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks