68 results found Sort:

492
5.8k
apache-2.0
27
The LLM Evaluation Framework
Created 2023-08-10
4,638 commits to main branch, last one 10 hours ago
Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including OpenAI Agents SDK, CrewAI, Langchain, Autogen, AG2, and CamelAI
Created 2023-08-15
639 commits to main branch, last one 5 days ago
《大模型白盒子构建指南》:一个全手搓的Tiny-Universe
Created 2024-04-06
138 commits to main branch, last one about a month ago
405
1.7k
other
50
(IROS 2020, ECCVW 2020) Official Python Implementation for "3D Multi-Object Tracking: A Baseline and New Evaluation Metrics"
Created 2019-06-19
221 commits to master branch, last one about a year ago
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
Created 2024-01-26
351 commits to main branch, last one 11 hours ago
Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!
Created 2024-10-09
73 commits to main branch, last one 2 months ago
48
816
apache-2.0
9
[NeurIPS'21 Outstanding Paper] Library for reliable evaluation on RL and ML benchmarks, even with only a handful of seeds.
Created 2021-08-20
72 commits to master branch, last one 7 months ago
110
753
mit
13
OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)
Created 2020-03-13
1,150 commits to master branch, last one 8 months ago
101
704
apache-2.0
15
Evaluate your speech-to-text system with similarity measures such as word error rate (WER)
Created 2018-06-19
107 commits to master branch, last one about a month ago
:chart_with_upwards_trend: Implementation of eight evaluation metrics to access the similarity between two images. The eight metrics are as follows: RMSE, PSNR, SSIM, ISSM, FSIM, SRE, SAM, and UIQ.
Created 2020-04-06
98 commits to master branch, last one 7 months ago
88
562
apache-2.0
19
A Neural Framework for MT Evaluation
Created 2020-05-28
561 commits to master branch, last one 5 days ago
26
534
mit
10
⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍
Created 2020-06-02
284 commits to master branch, last one 9 months ago
33
484
apache-2.0
4
Data-Driven Evaluation for LLM-Powered Applications
Created 2023-12-08
106 commits to main branch, last one 2 months ago
67
477
gpl-3.0
30
PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks su...
Created 2010-07-06
2,160 commits to master branch, last one about a year ago
Source code for "Taming Visually Guided Sound Generation" (Oral at the BMVC 2021)
Created 2021-10-17
24 commits to main branch, last one 8 months ago
[RAL' 2025] MapEval: Towards Unified, Robust and Efficient SLAM Map Evaluation Framework.
Created 2022-12-30
69 commits to main branch, last one 9 days ago
30
292
bsd-3-clause
8
Resources for the "Evaluating the Factual Consistency of Abstractive Text Summarization" paper
Created 2019-10-23
5 commits to master branch, last one 3 years ago
Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.
Created 2023-10-23
343 commits to main branch, last one 4 months ago
Python SDK for running evaluations on LLM generated responses
Created 2023-11-22
784 commits to main branch, last one 4 days ago
13
272
bsd-3-clause
12
[ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Created 2023-06-15
366 commits to main branch, last one about a year ago
A list of works on evaluation of visual generation models, including evaluation metrics, models, and systems
Created 2024-03-16
93 commits to master branch, last one about a month ago
Code base for the precision, recall, density, and coverage metrics for generative models. ICML 2020.
Created 2020-02-21
6 commits to master branch, last one 4 years ago
A Python wrapper for the ROUGE summarization evaluation package
Created 2014-01-14
38 commits to master branch, last one 5 years ago
It is a Natural Language Processing Problem where Sentiment Analysis is done by Classifying the Positive tweets from negative tweets by machine learning models for classification, text mining, text a...
Created 2019-03-31
10 commits to master branch, last one 5 years ago
Foundation model benchmarking tool. Run any model on any AWS platform and benchmark for performance across instance type and serving stack options.
Created 2024-01-09
1,491 commits to main branch, last one 25 days ago
An implementation of a full named-entity evaluation metrics based on SemEval'13 Task 9 - not at tag/token level but considering all the tokens that are part of the named-entity
Created 2018-05-17
65 commits to master branch, last one 9 months ago
Awesome diffusion Video-to-Video (V2V). A collection of paper on diffusion model-based video editing, aka. video-to-video (V2V) translation. And a video editing benchmark code.
Created 2024-06-15
22 commits to main branch, last one 2 months ago
28
185
mit
9
CLEval: Character-Level Evaluation for Text Detection and Recognition Tasks
Created 2020-05-27
15 commits to master branch, last one about a year ago
52
182
apache-2.0
19
🦄 Unitxt: a python library for getting data fired up and set for training and evaluation
Created 2023-06-15
2,723 commits to main branch, last one 5 days ago
PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection
Created 2020-07-08
66 commits to master branch, last one 6 months ago