Statistics for topic benchmark
RepositoryStats tracks 638,211 Github repositories, of these 831 are tagged with the benchmark topic. The most common primary language for repositories using this topic is Python (354). Other languages include: C++ (54), Jupyter Notebook (53), Go (52), JavaScript (27), Rust (24), TypeScript (24), C (24), Shell (23), Java (21)
Stargazers over time for topic benchmark
Most starred repositories for topic benchmark (view more)
Trending repositories for topic benchmark (view more)
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
SWE-bench [Multimodal]: Can Language Models Resolve Real-world Github Issues?
⚡FlashRAG: A Python Toolkit for Efficient RAG Research (WWW2025 Resource)
The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs and CPUs via OpenCL. Free for non-commercial use.
Overview of pipelines related to PDF document processing.
🚀 Spiko is a fast, Rust-based load testing tool with a beautiful TUI for real-time insights.
Official implementation for WorldScore: A Unified Evaluation Benchmark for World Generation
[NeurIPS'24 Spotlight] A comprehensive benchmark & codebase for Image manipulation detection/localization.
A multi-player tournament benchmark that tests LLMs in social reasoning, strategy, and deception. Players engage in public and private conversations, form alliances, and vote to eliminate each other
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
Official implementation for WorldScore: A Unified Evaluation Benchmark for World Generation
SWE-bench [Multimodal]: Can Language Models Resolve Real-world Github Issues?
Official implementation for WorldScore: A Unified Evaluation Benchmark for World Generation
Overview of pipelines related to PDF document processing.
A multi-player tournament benchmark that tests LLMs in social reasoning, strategy, and deception. Players engage in public and private conversations, form alliances, and vote to eliminate each other
Multi-Mission Tool Bench: Assessing the Robustness of LLM based Agents through Related and Dynamic Missions
Official implementation for WorldScore: A Unified Evaluation Benchmark for World Generation
Multi-Mission Tool Bench: Assessing the Robustness of LLM based Agents through Related and Dynamic Missions
Public Goods Game (PGG) Benchmark: Contribute & Punish is a multi-agent benchmark that tests cooperative and self-interested strategies among Large Language Models (LLMs) in a resource-sharing economi...
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
⚡FlashRAG: A Python Toolkit for Efficient RAG Research (WWW2025 Resource)
A multi-player tournament benchmark that tests LLMs in social reasoning, strategy, and deception. Players engage in public and private conversations, form alliances, and vote to eliminate each other
WritingBench: A Comprehensive Benchmark for Generative Writing
🚀 Spiko is a fast, Rust-based load testing tool with a beautiful TUI for real-time insights.
Financial Time Series Benchmark (FinTSB): A Comprehensive and Practical Benchmark for Financial Time Series Forecasting
[CVPR 25 (Highlight)] RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins
VPS融合怪服务器测评项目-GO版本(VPS Fusion Monster Server Test Script)(尽量做最全能测试服务器的项目)(无额外环境依赖)
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
VPS融合怪服务器测评项目(VPS Fusion Monster Server Test Script) 更推荐使用无环境依赖的=>https://github.com/oneclickvirt/ecs
⚡FlashRAG: A Python Toolkit for Efficient RAG Research (WWW2025 Resource)
Official code repository of < CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph >
[CVPR 25 (Highlight)] RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins
Learning how to write "Less Slow" code in C++ 20, C 99, CUDA, PTX, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO
A banchmark list for evaluation of large language models.