Trending repositories for topic benchmark
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
YABS - a simple bash script to estimate Linux server performance using fio, iperf3, & Geekbench
[ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs and CPUs via OpenCL. Free for non-commercial use.
Official code repository of CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
A MNIST-like fashion product database. Benchmark :point_down:
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
JMLR: OmniSafe is an infrastructural framework for accelerating SafeRL research.
Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024
Learning how to write "Less Slow" code in C++20, from numerical micro-kernels and SIMD to coroutines, ranges, and polymorphic state machines
A large-scale benchmark for machine learning methods in fluid dynamics
Official code repository of CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph
RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins
An agent benchmark with tasks in a simulated software company.
High-precision and consistent benchmarking framework/harness for Rust
A list of works on evaluation of visual generation models, including evaluation metrics, models, and systems
VPS融合怪服务器测评项目(VPS Fusion Monster Server Test Script)(尽量做最全能测试服务器的脚本)
Official code for CVPR 2022 (Oral) paper "Deep Visual Geo-localization Benchmark"
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
[ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
YABS - a simple bash script to estimate Linux server performance using fio, iperf3, & Geekbench
[ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?
An agent benchmark with tasks in a simulated software company.
The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs and CPUs via OpenCL. Free for non-commercial use.
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024
An agent benchmark with tasks in a simulated software company.
Learning how to write "Less Slow" code in C++20, from numerical micro-kernels and SIMD to coroutines, ranges, and polymorphic state machines
Human Benchmark is a Flutter app for Android, it has many tests to test your abilities.
Official code repository of CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph
High-precision and consistent benchmarking framework/harness for Rust
AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.
RAID is the largest and most challenging benchmark for machine-generated text detectors. (ACL 2024)
Companion code for FanOutQA: Multi-Hop, Multi-Document Question Answering for Large Language Models (ACL 2024)
A Python tool to evaluate the performance of VLM on the medical domain.
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
VPS融合怪服务器测评项目(VPS Fusion Monster Server Test Script)(尽量做最全能测试服务器的脚本)
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
[ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?
YABS - a simple bash script to estimate Linux server performance using fio, iperf3, & Geekbench
The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs and CPUs via OpenCL. Free for non-commercial use.
AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.
An agent benchmark with tasks in a simulated software company.
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
🔥🔥🔥 Latest Advances on Large Recommendation Models
🔥🔥🔥 Latest Advances on Large Recommendation Models
An agent benchmark with tasks in a simulated software company.
AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.
Learning how to write "Less Slow" code in C++20, from numerical micro-kernels and SIMD to coroutines, ranges, and polymorphic state machines
CAMEL-Bench is an Arabic benchmark for evaluating multimodal models across eight domains with 29,000 questions.
[Neurips 2024] A benchmark suite for autoregressive neural emulation of PDEs. (≥46 PDEs in 1D, 2D, 3D; Differentiable Physics; Unrolled Training; Rollout Metrics)
The Dawn of Video Generation: Preliminary Explorations with SORA-like Models
TSB-AD: Towards A Reliable Time-Series Anomaly Detection Benchmark
Human Benchmark is a Flutter app for Android, it has many tests to test your abilities.
🌈 Visualizes your BenchmarkDotNet benchmarks to Colorful images and Feature-rich HTML (and maybe powerful charts in the future!)
This repository contains the implementation for the paper "Revisiting Few Shot Object Detection with Vision-Language Models"
This repo contains the code and data for "MEGA-Bench Scaling Multimodal Evaluation to over 500 Real-World Tasks"
[EMNLP 2024 Findings] To Forget or Not? Towards Practical Knowledge Unlearning for Large Language Models
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".
Official code repository of CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph
CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models
A mini-framework for evaluating LLM performance on the Bulls and Cows number guessing game, supporting multiple LLM providers.
A list of works on evaluation of visual generation models, including evaluation metrics, models, and systems
Foundation model benchmarking tool. Run any model on any AWS platform and benchmark for performance across instance type and serving stack options.
This project collects GPU benchmarks from various cloud providers and compares them to fixed per token costs. Use our tool for efficient LLM GPU selections and cost-effective AI models. LLM provider p...
[NeurIPS 2024 D&B Spotlight🔥] ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation
Environments, tools, and benchmarks for general computer agents
The Dawn of Video Generation: Preliminary Explorations with SORA-like Models
A self-ailgnment method for role-play. Benchmark for role-play. Resources for "Large Language Models are Superpositions of All Characters: Attaining Arbitrary Role-play via Self-Alignment".
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
VPS融合怪服务器测评项目(VPS Fusion Monster Server Test Script)(尽量做最全能测试服务器的脚本)
[ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
YABS - a simple bash script to estimate Linux server performance using fio, iperf3, & Geekbench
The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs and CPUs via OpenCL. Free for non-commercial use.
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
Official code repository of CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph
A straightforward JavaScript benchmarking tool and REPL with support for ES modules and libraries.
An open-source toolbox for fast sampling of diffusion models. Official implementations of our works published in ICML, NeurIPS, CVPR.
[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.
This is a reposotory that includes paper、code and datasets about domain generalization-based fault diagnosis and prognosis. (基于领域泛化的故障诊断和预测)
RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins
🚀 A comprehensive performance comparison benchmark between different .NET collections.
[ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles.
A self-ailgnment method for role-play. Benchmark for role-play. Resources for "Large Language Models are Superpositions of All Characters: Attaining Arbitrary Role-play via Self-Alignment".