Search Results - RepositoryStats

lm-evaluation-harness EleutherAI

2.3k

8.5k

mit

38

A framework for few-shot evaluation of language models.

transformer language-model evaluation-framework

Created 2020-08-28

3,716 commits to main branch, last one 3 days ago

promptfoo promptfoo

505

6.1k

mit

21

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command ...

ci llm rag cicd ci-cd llmops prompts testing llm-eval evaluation pentesting red-teaming llm-evaluation prompt-testing prompt-engineering evaluation-framework vulnerability-scanners llm-evaluation-framework

Created 2023-04-28

4,103 commits to main branch, last one 5 hours ago

deepeval confident-ai

504

5.9k

apache-2.0

27

The LLM Evaluation Framework

llm-evaluation evaluation-metrics evaluation-framework llm-evaluation-metrics llm-evaluation-framework

Created 2023-08-10

4,685 commits to main branch, last one 14 hours ago

lighteval huggingface

216

1.4k

mit

28

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

evaluation huggingface evaluation-metrics evaluation-framework

Created 2024-01-26

353 commits to main branch, last one 3 days ago

RecSys2019_DeepLearning_Evaluation MaurizioFD

249

987

agpl-3.0

37

This is the repository of our article published in RecSys 2019 "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches" and of several follow-up studies.

Created 2019-04-02

63 commits to master branch, last one 3 years ago

continuous-eval relari-ai

33

486

apache-2.0

4

Data-Driven Evaluation for LLM-Powered Applications

rag llmops llm-evaluation evaluation-metrics evaluation-framework information-retrieval retrieval-augmented-generation

Created 2023-12-08

106 commits to main branch, last one 2 months ago

tonic_validate TonicAI

30

293

mit

14

Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.

llm rag llms llmops evaluation-metrics evaluation-framework large-language-models retrieval-augmented-generation

Created 2023-10-23

343 commits to main branch, last one 4 months ago

AgentLab ServiceNow

59

291

other

5

AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.

lab llm agent agents benchmark prompting llm-agents web-agents evaluation-framework

Created 2024-05-21

542 commits to main branch, last one 7 days ago

athina-evals athina-ai

17

276

unknown

5

Python SDK for running evaluations on LLM generated responses

llmops llm-ops llm-eval evaluation llm-evaluation evaluation-metrics evaluation-framework llm-evaluation-toolkit

Created 2023-11-22

787 commits to main branch, last one 2 days ago

MixEval JinjieNi

40

234

unknown

1

The official evaluation suite and dynamic data release for MixEval.

mixeval benchmark evaluation llm-inference llm-evaluation benchmark-mixture foundation-models benchmarking-suite evaluation-framework large-language-model large-language-models benchmarking-framework large-multimodal-models llm-evaluation-framework

Created 2024-06-01

120 commits to main branch, last one 4 months ago

moonshot aiverify-foundation

48

223

apache-2.0

8

Moonshot - A simple and modular tool to evaluate and red-team any LLM application.

llm red-teaming benchmarking trustworthy-ai evaluation-framework

Created 2023-12-14

2,033 commits to main branch, last one 2 months ago

PyDGN diningphil

13

221

bsd-3-clause

5

A research library for automating experiments on Deep Graph Networks

deep-graph-networks evaluation-framework deep-learning-for-graphs

Created 2020-03-21

461 commits to main branch, last one 6 months ago

zeno zeno-ml

11

215

mit

7

AI Data Management & Evaluation Platform

ai python evaluation data-science machine-learning evaluation-framework

This repository has been archived (exclude archived)

Created 2022-02-03

1,044 commits to main branch, last one about a year ago

PySODEvalToolkit lartpang

21

176

mit

1

PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection

Created 2020-07-08

66 commits to master branch, last one 6 months ago

expressive bijington

26

168

mit

10

Expressive is a cross-platform expression parsing and evaluation framework. The cross-platform nature is achieved through compiling for .NET Standard so it will run on practically any platform.

parsing xamarin evaluation netstandard hacktoberfest cross-platform expression-parser evaluation-framework expression-evaluator

Created 2016-06-13

291 commits to main branch, last one 6 months ago

eval-dev-quality symflower

8

164

mit

5

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.

llms evaluation software-quality evaluation-framework software-development

Created 2024-03-28

823 commits to main branch, last one 7 days ago

empirical empirical-run

13

156

mit

6

Test and evaluate LLMs and model configurations, across all the scenarios that matter for your application

llm llmops testing llm-inference test-automation testing-framework evaluation-framework

Created 2024-03-14

218 commits to main branch, last one 10 months ago

lm-evaluation AI21Labs

15

124

apache-2.0

5

Evaluation suite for large-scale language models.

language-model evaluation-framework

Created 2021-08-05

7 commits to main branch, last one 3 years ago

mlmm-evaluation nlp-uoregon

17

121

apache-2.0

4

Multilingual Large Language Models Evaluation Benchmark

nlp datasets evaluation multilingual language-model evaluation-datasets evaluation-framework large-language-models natural-language-processing

Created 2023-08-07

18 commits to main branch, last one about a year ago

eureka-ml-insights microsoft

18

117

apache-2.0

15

A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.

ai llm mllm machine-learning evaluation-framework artificial-intelligence

Created 2024-07-18

86 commits to main branch, last one 2 days ago

CrowdFlow tsenst

22

113

gpl-3.0

6

Optical Flow Dataset and Benchmark for Visual Crowd Analysis

dataset tracking optical-flow trajectories crowd-analysis crowd-counting benchmark-suite computer-vision video-analytics synthetic-images video-processing motion-estimation video-surveillance evaluation-framework multi-object-tracking tracking-by-detection tub-crowdflow-dataset

Created 2018-09-10

40 commits to master branch, last one about a year ago

NL2SQL360 HKUSTDial

10

105

mit

1

Official repository for the paper “The Dawn of Natural Language to SQL: Are We Fully Ready?” (VLDB'24)

awesome text2sql text-to-sql awesome-nl2sql awesome-text2sql nl2sql-evaluation evaluation-metrics text2sql-evaluation evaluation-framework

Created 2024-01-29

66 commits to master branch, last one 12 days ago

eva kaiko-ai

9

97

apache-2.0

5

Evaluation framework for oncology foundation models (FMs)

oncology machine-learning foundation-models evaluation-framework

Created 2024-01-16

340 commits to main branch, last one 6 days ago

EuroEval EuroEval

24

94

mit

7

The robust European language model benchmark.

llms european dutch-language danish-language french-language german-language english-language faroese-language italian-language spanish-language swedish-language icelandic-language norwegian-language evaluation-framework nlp-machine-learning

Created 2021-07-17

3,117 commits to main branch, last one 2 days ago

lidar_slam_evaluator haeyeoni

16

94

unknown

2

LiDAR SLAM comparison and evaluation framework

slam lidar-slam evaluation-framework

Created 2021-07-26

17 commits to main branch, last one 3 years ago

codefuse-evaluation codefuse-ai

13

92

other

3

Industrial-level evaluation benchmarks for Coding LLMs in the full life-cycle of AI native software developing.企业级代码大模型评测体系,持续开放中

lcc codefuse codetranseval code-evaluation codecommenteval repository-eval evaluation-framework

Created 2023-09-28

28 commits to master branch, last one 7 days ago

pyRDDLGym pyrddlgym-project

21

80

mit

7

A toolkit for auto-generation of OpenAI Gym environments from RDDL description files.

Created 2022-07-10

2,123 commits to main branch, last one 15 days ago

SORDI-AI-Evaluation-GUI BMW-InnovationLab

5

79

apache-2.0

2

This repository allows you to evaluate a trained computer vision model and get general information and evaluation metrics with little configuration.

ai bmw sordi docker python dataset no-code rest-api evaluation tensorflow deeplearning synthetic-data computer-vision evaluation-framework

Created 2022-10-28

10 commits to main branch, last one about a year ago

DialogEntailment nouhadziri

5

74

mit

6

The implementation of the paper "Evaluating Coherence in Dialogue Systems using Entailment"

bert dialogue-evaluation evaluation-framework natural-language-inference

Created 2019-04-13

10 commits to master branch, last one 6 months ago

Diffusion-MU-Attack OPTML-Group

3

72

mit

0

The official implementation of ECCV'24 paper "To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now". This work introduces one fast and effe...

robustness unlearning stable-diffusion adversarial-attacks evaluation-framework attack-unlearned-diffusion-model

Created 2023-10-17

70 commits to main branch, last one about a month ago