Trending repositories for topic benchmark

Last 3 days (new repositories)

no newly created repositories trending in the last 3 days

Last 3 days (absolute gain)

hatoo/oha

Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.

6,889 (+32)

mit

spiritLHLS/ecs

VPS融合怪服务器测评项目(VPS Fusion Monster Server Test Script)(尽量做最全能测试服务器的脚本)

4,467 (+31)

mit

microsoft/MMLU-CF

A Contamination-free Multi-task Language Understanding Benchmark

66 (+26)

mit

sharkdp/hyperfine

A command-line benchmarking tool

23,424 (+22)

apache-2.0

open-compass/opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

4,390 (+18)

apache-2.0

RUC-NLPIR/FlashRAG

⚡FlashRAG: A Python Toolkit for Efficient RAG Research

1,535 (+16)

mit

google/benchmark

A microbenchmark support library

9,153 (+12)

apache-2.0

masonr/yet-another-bench-script

YABS - a simple bash script to estimate Linux server performance using fio, iperf3, & Geekbench

4,614 (+11)

wtfpl

swe-bench/SWE-bench

[ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?

2,182 (+10)

mit

evanwashere/mitata

benchmark tooling that loves you ❤️

1,586 (+7)

mit

OpenGVLab/InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding

1,492 (+7)

apache-2.0

ProjectPhysX/FluidX3D

The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs and CPUs via OpenCL. Free for non-commercial use.

4,128 (+7)

TianxingChen/RoboTwin

RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins

469 (+6)

mit

THUDM/LongBench

LongBench v2 and LongBench (ACL 2024)

719 (+5)

mit

EDAPINENUT/CBGBench

Official code repository of CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph

284 (+5)

gpl-3.0

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

1,493 (+5)

apache-2.0

zalandoresearch/fashion-mnist

A MNIST-like fashion product database. Benchmark :point_down:

12,038 (+4)

mit

Xnhyacinth/Awesome-LLM-Long-Context-Modeling

📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥

1,102 (+4)

mit

PKU-Alignment/omnisafe

JMLR: OmniSafe is an infrastructural framework for accelerating SafeRL research.

961 (+3)

apache-2.0

evalplus/evalplus

Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024

1,314 (+3)

apache-2.0

Last 3 days (relative gain)

microsoft/MMLU-CF

A Contamination-free Multi-task Language Understanding Benchmark

66 (+65%)

mit

lechmazur/confabulations

Hallucinations (Confabulations) Document-Based Benchmark for RAG

62 (+5%)

ashvardanian/less_slow.cpp

Learning how to write "Less Slow" code in C++20, from numerical micro-kernels and SIMD to coroutines, ranges, and polymorphic state machines

45 (+2%)

luo-yining/CFDBench

A large-scale benchmark for machine learning methods in fluid dynamics

149 (+2%)

EDAPINENUT/CBGBench

Official code repository of CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph

284 (+2%)

gpl-3.0

CoIR-team/coir

A Comprehensive Benchmark for Code Information Retrieval.

68 (+1%)

apache-2.0

TianxingChen/RoboTwin

RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins

469 (+1%)

mit

TheAgentCompany/TheAgentCompany

An agent benchmark with tasks in a simulated software company.

84 (+1%)

mit

oneclickvirt/ecs

VPS融合怪服务器测评项目-GO重构版本(VPS Fusion Monster Server Test Script)

371 (+1%)

gpl-3.0

RUC-NLPIR/FlashRAG

⚡FlashRAG: A Python Toolkit for Efficient RAG Research

1,535 (+1%)

mit

iai-callgrind/iai-callgrind

High-precision and consistent benchmarking framework/harness for Rust

103 (+1.0%)

apache-2.0

ziqihuangg/Awesome-Evaluation-of-Visual-Generation

A list of works on evaluation of visual generation models, including evaluation metrics, models, and systems

216 (+0.9%)

THUDM/LongBench

LongBench v2 and LongBench (ACL 2024)

719 (+0.7%)

mit

spiritLHLS/ecs

VPS融合怪服务器测评项目(VPS Fusion Monster Server Test Script)(尽量做最全能测试服务器的脚本)

4,467 (+0.7%)

mit

turkaysoftware/glow

System Analysis Software

150 (+0.7%)

mit

tarasko/picows

Ultra-fast websocket client and server for asyncio

159 (+0.6%)

mit

gmberton/deep-visual-geo-localization-benchmark

Official code for CVPR 2022 (Oral) paper "Deep Visual Geo-localization Benchmark"

196 (+0.5%)

mit

hatoo/oha

Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.

6,889 (+0.5%)

mit

swe-bench/SWE-bench

[ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?

2,182 (+0.5%)

mit

THUDM/LongCite

LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA

442 (+0.5%)

apache-2.0

Last week (new repositories)

no newly created repositories trending in the last week

Last week (absolute gain)

hatoo/oha

Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.

6,889 (+134)

mit

spiritLHLS/ecs

VPS融合怪服务器测评项目(VPS Fusion Monster Server Test Script)(尽量做最全能测试服务器的脚本)

4,467 (+69)

mit

RUC-NLPIR/FlashRAG

⚡FlashRAG: A Python Toolkit for Efficient RAG Research

1,535 (+61)

mit

sharkdp/hyperfine

A command-line benchmarking tool

23,424 (+53)

apache-2.0

microsoft/MMLU-CF

A Contamination-free Multi-task Language Understanding Benchmark

66 (+48)

mit

open-compass/opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

4,390 (+32)

apache-2.0

masonr/yet-another-bench-script

YABS - a simple bash script to estimate Linux server performance using fio, iperf3, & Geekbench

4,614 (+28)

wtfpl

swe-bench/SWE-bench

[ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?

2,182 (+27)

mit

evanwashere/mitata

benchmark tooling that loves you ❤️

1,586 (+25)

mit

TheAgentCompany/TheAgentCompany

An agent benchmark with tasks in a simulated software company.

84 (+22)

mit

ProjectPhysX/FluidX3D

The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs and CPUs via OpenCL. Free for non-commercial use.

4,128 (+22)

google/benchmark

A microbenchmark support library

9,153 (+17)

apache-2.0

OpenGVLab/InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding

1,492 (+16)

apache-2.0

open-mmlab/mmpose

OpenMMLab Pose Estimation Toolbox and Benchmark.

5,984 (+15)

apache-2.0

bheisler/criterion.rs

Statistics-driven benchmarking library for Rust

4,723 (+15)

apache-2.0

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

1,493 (+14)

apache-2.0

THUDM/LongBench

LongBench v2 and LongBench (ACL 2024)

719 (+12)

mit

oneclickvirt/ecs

VPS融合怪服务器测评项目-GO重构版本(VPS Fusion Monster Server Test Script)

371 (+12)

gpl-3.0

embeddings-benchmark/mteb

MTEB: Massive Text Embedding Benchmark

2,045 (+11)

apache-2.0

evalplus/evalplus

Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024

1,314 (+10)

apache-2.0

Last week (relative gain)

microsoft/MMLU-CF

A Contamination-free Multi-task Language Understanding Benchmark

66 (+267%)

mit

TheAgentCompany/TheAgentCompany

An agent benchmark with tasks in a simulated software company.

84 (+35%)

mit

ashvardanian/less_slow.cpp

Learning how to write "Less Slow" code in C++20, from numerical micro-kernels and SIMD to coroutines, ranges, and polymorphic state machines

45 (+18%)

Xiaopengli1/Scenario-Wise-Rec

Benchmark for Multi-Scenario-Recommendation.

31 (+11%)

mit

lechmazur/confabulations

Hallucinations (Confabulations) Document-Based Benchmark for RAG

62 (+9%)

RafaelGSS/bench-node

A powerful Node.js benchmark library

87 (+5%)

PrintN/Human-Benchmark

Human Benchmark is a Flutter app for Android, it has many tests to test your abilities.

45 (+5%)

gpl-3.0

argonne-lcf/LLM-Inference-Bench

LLM-Inference-Bench

25 (+4%)

bsd-3-clause

RUC-NLPIR/FlashRAG

⚡FlashRAG: A Python Toolkit for Efficient RAG Research

1,535 (+4%)

mit

X-PLUG/mPLUG-HalOwl

mPLUG-HalOwl: Multimodal Hallucination Evaluation and Mitigating

86 (+4%)

mit

oneclickvirt/ecs

VPS融合怪服务器测评项目-GO重构版本(VPS Fusion Monster Server Test Script)

371 (+3%)

gpl-3.0

EDAPINENUT/CBGBench

Official code repository of CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph

284 (+3%)

gpl-3.0

tarasko/picows

Ultra-fast websocket client and server for asyncio

159 (+3%)

mit

iai-callgrind/iai-callgrind

High-precision and consistent benchmarking framework/harness for Rust

103 (+3%)

apache-2.0

ServiceNow/AgentLab

AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.

168 (+2%)

liamdugan/raid

RAID is the largest and most challenging benchmark for machine-generated text detectors. (ACL 2024)

43 (+2%)

mit

zhudotexe/fanoutqa

Companion code for FanOutQA: Multi-Hop, Multi-Document Question Answering for Large Language Models (ACL 2024)

44 (+2%)

mit

isl-org/objects-with-lighting

Repository for the Objects With Lighting Dataset

47 (+2%)

apache-2.0

corentin-ryr/MultiMedEval

A Python tool to evaluate the performance of VLM on the medical domain.

48 (+2%)

mit

turkaysoftware/glow

System Analysis Software

150 (+2%)

mit

Last month (new repositories)

microsoft/MMLU-CF

A Contamination-free Multi-task Language Understanding Benchmark

mit

Last month (absolute gain)

sharkdp/hyperfine

A command-line benchmarking tool

23,424 (+492)

apache-2.0

hatoo/oha

Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.

6,889 (+360)

mit

spiritLHLS/ecs

VPS融合怪服务器测评项目(VPS Fusion Monster Server Test Script)(尽量做最全能测试服务器的脚本)

4,467 (+298)

mit

open-compass/opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

4,390 (+181)

apache-2.0

RUC-NLPIR/FlashRAG

⚡FlashRAG: A Python Toolkit for Efficient RAG Research

1,535 (+162)

mit

swe-bench/SWE-bench

[ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?

2,182 (+150)

mit

masonr/yet-another-bench-script

YABS - a simple bash script to estimate Linux server performance using fio, iperf3, & Geekbench

4,614 (+149)

wtfpl

ProjectPhysX/FluidX3D

The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs and CPUs via OpenCL. Free for non-commercial use.

4,128 (+142)

ServiceNow/AgentLab

AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.

168 (+111)

google/benchmark

A microbenchmark support library

9,153 (+91)

apache-2.0

evanwashere/mitata

benchmark tooling that loves you ❤️

1,586 (+84)

mit

dotnet/BenchmarkDotNet

Powerful .NET library for benchmarking

10,685 (+78)

mit

open-mmlab/mmpose

OpenMMLab Pose Estimation Toolbox and Benchmark.

5,984 (+78)

apache-2.0

TheAgentCompany/TheAgentCompany

An agent benchmark with tasks in a simulated software company.

84 (+76)

mit

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

1,493 (+75)

apache-2.0

Xnhyacinth/Awesome-LLM-Long-Context-Modeling

📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥

1,102 (+71)

mit

oneclickvirt/ecs

VPS融合怪服务器测评项目-GO重构版本(VPS Fusion Monster Server Test Script)

371 (+69)

gpl-3.0

open-mmlab/mmaction2

OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark

4,383 (+69)

apache-2.0

bheisler/criterion.rs

Statistics-driven benchmarking library for Rust

4,723 (+65)

apache-2.0

USTC-StarTeam/Awesome-Large-Recommendation-Models

🔥🔥🔥 Latest Advances on Large Recommendation Models

63 (+58)

Last month (relative gain)

microsoft/MMLU-CF

A Contamination-free Multi-task Language Understanding Benchmark

66 (+1,220%)

mit

USTC-StarTeam/Awesome-Large-Recommendation-Models

🔥🔥🔥 Latest Advances on Large Recommendation Models

63 (+1,160%)

TheAgentCompany/TheAgentCompany

An agent benchmark with tasks in a simulated software company.

84 (+950%)

mit

ServiceNow/AgentLab

AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.

168 (+195%)

balrog-ai/BALROG

Benchmarking Agentic LLM and VLM Reasoning On Games

81 (+88%)

mit

ashvardanian/less_slow.cpp

Learning how to write "Less Slow" code in C++20, from numerical micro-kernels and SIMD to coroutines, ranges, and polymorphic state machines

45 (+73%)

argonne-lcf/LLM-Inference-Bench

LLM-Inference-Bench

25 (+56%)

bsd-3-clause

mbzuai-oryx/Camel-Bench

CAMEL-Bench is an Arabic benchmark for evaluating multimodal models across eight domains with 29,000 questions.

30 (+50%)

mit

tum-pbs/apebench

[Neurips 2024] A benchmark suite for autoregressive neural emulation of PDEs. (≥46 PDEs in 1D, 2D, 3D; Differentiable Physics; Unrolled Training; Rollout Metrics)

52 (+33%)

mit

jiegec/cpu-micro-benchmarks

CPU micro benchmarks

38 (+31%)

AILab-CVC/VideoGen-Eval

The Dawn of Video Generation: Preliminary Explorations with SORA-like Models

182 (+27%)

mit

TheDatumOrg/TSB-AD

TSB-AD: Towards A Reliable Time-Series Anomaly Detection Benchmark

61 (+24%)

apache-2.0

oneclickvirt/ecs

VPS融合怪服务器测评项目-GO重构版本(VPS Fusion Monster Server Test Script)

371 (+23%)

gpl-3.0

PrintN/Human-Benchmark

Human Benchmark is a Flutter app for Android, it has many tests to test your abilities.

45 (+22%)

gpl-3.0

mjebrahimi/BenchmarkDotNetVisualizer

🌈 Visualizes your BenchmarkDotNet benchmarks to Colorful images and Feature-rich HTML (and maybe powerful charts in the future!)

47 (+21%)

mit

anishmadan23/foundational_fsod

This repository contains the implementation for the paper "Revisiting Few Shot Object Detection with Vision-Language Models"

30 (+20%)

apache-2.0

showlab/LOVA3

(NeurIPS 2024) Learning to Visual Question Answering, Asking and Assessment

79 (+20%)

TIGER-AI-Lab/MEGA-Bench

This repo contains the code and data for "MEGA-Bench Scaling Multimodal Evaluation to over 500 Real-World Tasks"

50 (+19%)

apache-2.0

zjunlp/KnowUnDo

[EMNLP 2024 Findings] To Forget or Not? Towards Practical Knowledge Unlearning for Large Language Models

25 (+19%)

mit

turkaysoftware/glow

System Analysis Software

150 (+18%)

mit

Last 12-months (new repositories)

RUC-NLPIR/FlashRAG

⚡FlashRAG: A Python Toolkit for Efficient RAG Research

1,535

mit

OpenGenerativeAI/llm-colosseum

Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM

1,355

mit

google-deepmind/long-form-factuality

Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".

567

TianxingChen/RoboTwin

RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins

469

mit

THUDM/LongCite

LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA

442

apache-2.0

ModelTC/llmc

[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".

372

apache-2.0

oneclickvirt/ecs

VPS融合怪服务器测评项目-GO重构版本(VPS Fusion Monster Server Test Script)

371

gpl-3.0

EDAPINENUT/CBGBench

Official code repository of CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph

284

gpl-3.0

bigcode-project/bigcodebench

BigCodeBench: Benchmarking Code Generation Towards AGI

259

apache-2.0

IAAR-Shanghai/CRUD_RAG

CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models

253

Psycoy/MixEval

The official evaluation suite and dynamic data release for MixEval.

230

stalkermustang/llm-bulls-and-cows-benchmark

A mini-framework for evaluating LLM performance on the Bulls and Cows number guessing game, supporting multiple LLM providers.

230

mit

MichaelTMatthews/Craftax

(Crafter + NetHack) in JAX. ICML 2024 Spotlight.

222

mit

ziqihuangg/Awesome-Evaluation-of-Visual-Generation

A list of works on evaluation of visual generation models, including evaluation metrics, models, and systems

216

aws-samples/foundation-model-benchmarking-tool

Foundation model benchmarking tool. Run any model on any AWS platform and benchmark for performance across instance type and serving stack options.

215

mit-0

arc53/llm-price-compass

This project collects GPU benchmarks from various cloud providers and compares them to fixed per token costs. Use our tool for efficient LLM GPU selections and cost-effective AI models. LLM provider p...

215

mit

PKU-YuanGroup/ChronoMagic-Bench

[NeurIPS 2024 D&B Spotlight🔥] ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation

189

apache-2.0

computer-agents/agent-studio

Environments, tools, and benchmarks for general computer agents

187

agpl-3.0

AILab-CVC/VideoGen-Eval

The Dawn of Video Generation: Preliminary Explorations with SORA-like Models

182

mit

OFA-Sys/Ditto

A self-ailgnment method for role-play. Benchmark for role-play. Resources for "Large Language Models are Superpositions of All Characters: Attaining Arbitrary Role-play via Self-Alignment".

176

mit

Last 12-months (absolute gain)

sharkdp/hyperfine

A command-line benchmarking tool

23,424 (+5,081)

apache-2.0

hatoo/oha

Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.

6,889 (+3,566)

mit

open-compass/opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

4,390 (+2,932)

apache-2.0

spiritLHLS/ecs

VPS融合怪服务器测评项目(VPS Fusion Monster Server Test Script)(尽量做最全能测试服务器的脚本)

4,467 (+2,673)

mit

swe-bench/SWE-bench

[ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?

2,182 (+1,981)

mit

RUC-NLPIR/FlashRAG

⚡FlashRAG: A Python Toolkit for Efficient RAG Research

1,535 (+1,533)

mit

open-mmlab/mmpose

OpenMMLab Pose Estimation Toolbox and Benchmark.

5,984 (+1,530)

apache-2.0

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

1,493 (+1,492)

apache-2.0

OpenGenerativeAI/llm-colosseum

Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM

1,355 (+1,345)

mit

masonr/yet-another-bench-script

YABS - a simple bash script to estimate Linux server performance using fio, iperf3, & Geekbench

4,614 (+1,335)

wtfpl

evanwashere/mitata

benchmark tooling that loves you ❤️

1,586 (+1,258)

mit

ProjectPhysX/FluidX3D

The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs and CPUs via OpenCL. Free for non-commercial use.

4,128 (+1,256)

embeddings-benchmark/mteb

MTEB: Massive Text Embedding Benchmark

2,045 (+1,176)

apache-2.0

Xnhyacinth/Awesome-LLM-Long-Context-Modeling

📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥

1,102 (+1,090)

mit

microsoft/promptbench

A unified evaluation framework for large language models

2,493 (+1,075)

mit

google/benchmark

A microbenchmark support library

9,153 (+1,065)

apache-2.0

tinylibs/tinybench

🔎 A simple, tiny and lightweight benchmarking library!

1,928 (+1,045)

mit

dotnet/BenchmarkDotNet

Powerful .NET library for benchmarking

10,685 (+986)

mit

IntelLabs/fastRAG

Efficient Retrieval Augmentation and Generation Framework

1,405 (+867)

apache-2.0

OpenGVLab/InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding

1,492 (+793)

apache-2.0

Last 12-months (relative gain)

OpenGenerativeAI/llm-colosseum

Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM

1,355 (+13,450%)

mit

Xnhyacinth/Awesome-LLM-Long-Context-Modeling

📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥

1,102 (+9,083%)

mit

EDAPINENUT/CBGBench

Official code repository of CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph

284 (+5,580%)

gpl-3.0

wfxr/rlt

A universal load testing framework for Rust, with real-time tui support.

157 (+3,825%)

mit

jsbenchmark/jsbenchmark

A straightforward JavaScript benchmarking tool and REPL with support for ES modules and libraries.

151 (+2,417%)

mit

zju-pi/diff-sampler

An open-source toolbox for fast sampling of diffusion models. Official implementations of our works published in ICML, NeurIPS, CVPR.

222 (+2,367%)

apache-2.0

ModelTC/llmc

[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".

372 (+2,225%)

apache-2.0

THUDM/LongCite

LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA

442 (+2,110%)

apache-2.0

bytedance/Shot2Story

A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.

104 (+1,980%)

balrog-ai/BALROG

Benchmarking Agentic LLM and VLM Reasoning On Games

81 (+1,925%)

mit

CHAOZHAO-1/DG-PHM

This is a reposotory that includes paper、code and datasets about domain generalization-based fault diagnosis and prognosis. (基于领域泛化的故障诊断和预测)

212 (+1,667%)

RafaelGSS/bench-node

A powerful Node.js benchmark library

87 (+1,640%)

TianxingChen/RoboTwin

RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins

469 (+1,637%)

mit

mjebrahimi/DotNet-Collections-Benchmark

🚀 A comprehensive performance comparison benchmark between different .NET collections.

66 (+1,550%)

mit

Hon-Wong/Elysium

[ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM

64 (+1,500%)

google-deepmind/long-form-factuality

Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".

567 (+1,392%)

mazzzystar/TurtleBench

TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles.

132 (+1,367%)

apache-2.0

turkaysoftware/glow

System Analysis Software

150 (+1,264%)

mit

OFA-Sys/Ditto

A self-ailgnment method for role-play. Benchmark for role-play. Resources for "Large Language Models are Superpositions of All Characters: Attaining Arbitrary Role-play via Self-Alignment".

176 (+1,157%)

mit

boomb0om/text2image-benchmark

Benchmark for generative image models

70 (+1,067%)

mit