Trending repositories for topic benchmark
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
[ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
Cista is a simple, high-performance, zero-copy C++ serialization & reflection library.
YABS - a simple bash script to estimate Linux server performance using fio, iperf3, & Geekbench
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024
Benchmarking Knowledge Transfer in Lifelong Robot Learning
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
An elegant RTSS Overlay to showcase your benchmark stats in style.
A straightforward JavaScript benchmarking tool and REPL with support for ES modules and libraries.
High-precision and consistent benchmarking framework/harness for Rust
Benchmarking Knowledge Transfer in Lifelong Robot Learning
BigCodeBench: Benchmarking Code Generation Towards AGI
A large-scale benchmark for machine learning methods in fluid dynamics
An open-source toolbox for fast sampling of diffusion models. Official implementations of our works published in ICML, NeurIPS, CVPR.
Learning how to write "Less Slow" code in C++ 20, C 99, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO
KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
benchmark dataset and Deep learning method (Hierarchical Interaction Network, HINT) for clinical trial approval probability prediction, published in Cell Patterns 2022.
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.
xFinder: Robust and Pinpoint Answer Extraction for Large Language Models
[ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
YABS - a simple bash script to estimate Linux server performance using fio, iperf3, & Geekbench
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Cista is a simple, high-performance, zero-copy C++ serialization & reflection library.
Learning how to write "Less Slow" code in C++ 20, C 99, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO
OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
A MNIST-like fashion product database. Benchmark :point_down:
An elegant RTSS Overlay to showcase your benchmark stats in style.
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
Learning how to write "Less Slow" code in C++ 20, C 99, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO
An agent benchmark with tasks in a simulated software company.
KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems
RAID is the largest and most challenging benchmark for machine-generated text detectors. (ACL 2024)
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
Benchmarking Knowledge Transfer in Lifelong Robot Learning
VPS测试脚本 | VPS性能测试(VPS基本信息、IO性能、全球测速、ping、回程路由测试)、BBR加速脚本(一种加速TCP的拥堵算法技术)、三网测速脚本(三网测速、流媒体检测)、线路路由测试(Linux VPS回程路由一键测试脚本)
BigCodeBench: Benchmarking Code Generation Towards AGI
A benchmark dataset collection for bird sound classification
AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.
Benchmark for quadratic programming solvers available in Python
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
VPS融合怪服务器测评项目(VPS Fusion Monster Server Test Script)(尽量做最全能测试服务器的脚本)
Learning how to write "Less Slow" code in C++ 20, C 99, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
[ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?
An agent benchmark with tasks in a simulated software company.
YABS - a simple bash script to estimate Linux server performance using fio, iperf3, & Geekbench
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs and CPUs via OpenCL. Free for non-commercial use.
A Contamination-free Multi-task Language Understanding Benchmark
An agent benchmark with tasks in a simulated software company.
Learning how to write "Less Slow" code in C++ 20, C 99, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO
KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems
RAID is the largest and most challenging benchmark for machine-generated text detectors. (ACL 2024)
AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.
TSB-AD: Towards A Reliable Time-Series Anomaly Detection Benchmark
A benchmark dataset collection for bird sound classification
ColdRec: An Open-Source Benchmark Toolbox for Cold-Start Recommendation.
[NeurIPS 2024] Evaluation harness for SWT-Bench, a benchmark for evaluating LLM repository-level test-generation
The Dawn of Video Generation: Preliminary Explorations with SORA-like Models
[NeurIPS'24 Spotlight] A comprehensive benchmark & codebase for Image manipulation detection/localization.
An elegant RTSS Overlay to showcase your benchmark stats in style.
🔥🔥🔥 Latest Advances on Large Recommendation Models
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".
CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models
Official code repository of CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph
A mini-framework for evaluating LLM performance on the Bulls and Cows number guessing game, supporting multiple LLM providers.
A list of works on evaluation of visual generation models, including evaluation metrics, models, and systems
This project collects GPU benchmarks from various cloud providers and compares them to fixed per token costs. Use our tool for efficient LLM GPU selections and cost-effective AI models. LLM provider p...
The Dawn of Video Generation: Preliminary Explorations with SORA-like Models
AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.
An agent benchmark with tasks in a simulated software company.
Awesome diffusion Video-to-Video (V2V). A collection of paper on diffusion model-based video editing, aka. video-to-video (V2V) translation. And a video editing benchmark code.
A self-ailgnment method for role-play. Benchmark for role-play. Resources for "Large Language Models are Superpositions of All Characters: Attaining Arbitrary Role-play via Self-Alignment".
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
VPS融合怪服务器测评项目(VPS Fusion Monster Server Test Script)(尽量做最全能测试服务器的脚本)
[ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
YABS - a simple bash script to estimate Linux server performance using fio, iperf3, & Geekbench
The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs and CPUs via OpenCL. Free for non-commercial use.
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
Awesome Deep Learning for Time-Series Imputation, including a must-read paper list about applying neural networks to impute incomplete time series containing NaN missing values/data
Official code repository of CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph
A Contamination-free Multi-task Language Understanding Benchmark
[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
An open-source toolbox for fast sampling of diffusion models. Official implementations of our works published in ICML, NeurIPS, CVPR.
A straightforward JavaScript benchmarking tool and REPL with support for ES modules and libraries.
RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins
🚀 A comprehensive performance comparison benchmark between different .NET collections.
[ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM
This is a reposotory that includes paper、code and datasets about domain generalization-based fault diagnosis and prognosis. (基于领域泛化的故障诊断和预测)
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles.
Learning how to write "Less Slow" code in C++ 20, C 99, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO
A self-ailgnment method for role-play. Benchmark for role-play. Resources for "Large Language Models are Superpositions of All Characters: Attaining Arbitrary Role-play via Self-Alignment".
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥