61 results found Sort:

291
3.6k
apache-2.0
21
The LLM Evaluation Framework
Created 2023-08-10
3,764 commits to main branch, last one 17 hours ago
Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks like CrewAI, Langchain, and Autogen
Created 2023-08-15
518 commits to main branch, last one 2 days ago
403
1.7k
other
49
(IROS 2020, ECCVW 2020) Official Python Implementation for "3D Multi-Object Tracking: A Baseline and New Evaluation Metrics"
Created 2019-06-19
221 commits to master branch, last one about a year ago
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
Created 2024-01-26
227 commits to main branch, last one a day ago
Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!
Created 2024-10-09
67 commits to main branch, last one 9 days ago
48
769
apache-2.0
11
[NeurIPS'21 Outstanding Paper] Library for reliable evaluation on RL and ML benchmarks, even with only a handful of seeds.
Created 2021-08-20
72 commits to master branch, last one 3 months ago
105
731
mit
15
OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)
Created 2020-03-13
1,150 commits to master branch, last one 3 months ago
97
636
apache-2.0
15
Evaluate your speech-to-text system with similarity measures such as word error rate (WER)
Created 2018-06-19
90 commits to master branch, last one 13 days ago
:chart_with_upwards_trend: Implementation of eight evaluation metrics to access the similarity between two images. The eight metrics are as follows: RMSE, PSNR, SSIM, ISSM, FSIM, SRE, SAM, and UIQ.
Created 2020-04-06
98 commits to master branch, last one 3 months ago
78
506
apache-2.0
20
A Neural Framework for MT Evaluation
Created 2020-05-28
547 commits to master branch, last one 4 months ago
67
479
gpl-3.0
31
PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks su...
Created 2010-07-06
2,160 commits to master branch, last one about a year ago
26
477
mit
11
⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍
Created 2020-06-02
284 commits to master branch, last one 4 months ago
29
446
apache-2.0
4
Data-Driven Evaluation for LLM-Powered Applications
Created 2023-12-08
96 commits to main branch, last one 2 months ago
Source code for "Taming Visually Guided Sound Generation" (Oral at the BMVC 2021)
Created 2021-10-17
24 commits to main branch, last one 4 months ago
31
285
bsd-3-clause
10
Resources for the "Evaluating the Factual Consistency of Abstractive Text Summarization" paper
Created 2019-10-23
5 commits to master branch, last one 3 years ago
Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.
Created 2023-10-23
342 commits to main branch, last one 4 months ago
13
255
bsd-3-clause
11
[ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Created 2023-06-15
366 commits to main branch, last one 8 months ago
A Python wrapper for the ROUGE summarization evaluation package
Created 2014-01-14
38 commits to master branch, last one 5 years ago
Code base for the precision, recall, density, and coverage metrics for generative models. ICML 2020.
Created 2020-02-21
6 commits to master branch, last one 4 years ago
It is a Natural Language Processing Problem where Sentiment Analysis is done by Classifying the Positive tweets from negative tweets by machine learning models for classification, text mining, text a...
Created 2019-03-31
10 commits to master branch, last one 5 years ago
Python SDK for running evaluations on LLM generated responses
Created 2023-11-22
568 commits to main branch, last one a day ago
An implementation of a full named-entity evaluation metrics based on SemEval'13 Task 9 - not at tag/token level but considering all the tokens that are part of the named-entity
Created 2018-05-17
65 commits to master branch, last one 4 months ago
Foundation model benchmarking tool. Run any model on any AWS platform and benchmark for performance across instance type and serving stack options.
Created 2024-01-09
1,145 commits to main branch, last one 19 hours ago
A list of works on evaluation of visual generation models, including evaluation metrics, models, and systems
Created 2024-03-16
80 commits to master branch, last one about a month ago
28
184
mit
10
CLEval: Character-Level Evaluation for Text Detection and Recognition Tasks
Created 2020-05-27
15 commits to master branch, last one about a year ago
PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection
Created 2020-07-08
66 commits to master branch, last one about a month ago
44
160
apache-2.0
17
🦄 Unitxt: a python library for getting data fired up and set for training and evaluation
Created 2023-06-15
2,435 commits to main branch, last one 21 hours ago
Full named-entity (i.e., not tag/token) evaluation metrics based on SemEval’13
Created 2019-02-22
255 commits to main branch, last one 9 days ago
36
158
gpl-3.0
6
Easier Automatic Sentence Simplification Evaluation
Created 2019-03-04
353 commits to master branch, last one about a year ago
Awesome diffusion Video-to-Video (V2V). A collection of paper on diffusion model-based video editing, aka. video-to-video (V2V) translation. And a video editing benchmark code.
Created 2024-06-15
15 commits to main branch, last one about a month ago