Repository navigation

#

evaluation-metrics

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI

Python
4795
2 小时前

《大模型白盒子构建指南》:一个全手搓的Tiny-Universe

Jupyter Notebook
3564
2 天前

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

Python
1826
10 小时前

(IROS 2020, ECCVW 2020) Official Python Implementation for "3D Multi-Object Tracking: A Baseline and New Evaluation Metrics"

Python
1773
1 年前

Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!

Jupyter Notebook
1540
7 个月前

[NeurIPS'21 Outstanding Paper] Library for reliable evaluation on RL and ML benchmarks, even with only a handful of seeds.

Jupyter Notebook
836
1 年前

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)

Python
782
1 年前

Evaluate your speech-to-text system with similarity measures such as word error rate (WER)

Python
775
6 个月前

📈 Implementation of eight evaluation metrics to access the similarity between two images. The eight metrics are as follows: RMSE, PSNR, SSIM, ISSM, FSIM, SRE, SAM, and UIQ.

Python
622
1 年前

⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍

Python
582
13 天前

PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).

Python
476
2 年前

[RAL' 25 & IROS‘ 25] MapEval: Towards Unified, Robust and Efficient SLAM Map Evaluation Framework.

C++
396
1 个月前
Jupyter Notebook
366
1 年前

A list of works on evaluation of visual generation models, including evaluation metrics, models, and systems

350
19 天前

Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.

Python
317
1 个月前

Resources for the "Evaluating the Factual Consistency of Abstractive Text Summarization" paper

Python
304
4 个月前

Python SDK for running evaluations on LLM generated responses

Python
292
2 个月前