Trending repositories for topic benchmark
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
[ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?
The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs and CPUs via OpenCL. Free for non-commercial use.
YABS - a simple bash script to estimate Linux server performance using fio, iperf3, & Geekbench
XcodeBenchmark measures the compilation time of a large codebase on iMac, MacBook, and Mac Pro
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
[ACL 2024] LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
[NeurIPS 2024] Touchstone - Benchmarking AI on 5,172 o.o.d. CT volumes and 9 anatomical structures
Benchmarking Knowledge Transfer in Lifelong Robot Learning
This repo contains the code and data for "MEGA-Bench Scaling Multimodal Evaluation to over 500 Real-World Tasks"
[NeurIPS 2024] Touchstone - Benchmarking AI on 5,172 o.o.d. CT volumes and 9 anatomical structures
Human Benchmark is a Flutter app for Android, it has many tests to test your abilities.
SustainDC is a set of Python environments for Data Center simulation and control using Heterogeneous Multi Agent Reinforcement Learning. Includes customizable environments for workload scheduling, coo...
The Dawn of Video Generation: Preliminary Explorations with SORA-like Models
[IJRR2024] The official repository for the WildScenes: A Benchmark for 2D and 3D Semantic Segmentation in Natural Environments
This is a reposotory that includes paper、code and datasets about domain generalization-based fault diagnosis and prognosis. (基于领域泛化的故障诊断和预测,持续更新)
Benchmarking Knowledge Transfer in Lifelong Robot Learning
MultiCorrupt: A benchmark for robust multi-modal 3D object detection, evaluating LiDAR-Camera fusion models in autonomous driving. Includes diverse corruption types (e.g., misalignment, miscalibration...
Benchmark of Apple MLX operations on all Apple Silicon chips (GPU, CPU) + MPS and CUDA.
[NeurIPS'24 Spotlight] A comprehensive benchmark & codebase for Image manipulation detection/localization.
Awesome Deep Learning for Time-Series Imputation, including a must-read paper list about applying neural networks to impute incomplete time series containing NaN missing values/data
nablaDFT: Large-Scale Conformational Energy and Hamiltonian Prediction benchmark and dataset
Java Virtual Machine (JVM) Performance Benchmarks with a primary focus on top-tier Just-In-Time (JIT) Compilers, such as C2 JIT, Graal JIT, and the Falcon JIT.
[CVPR 2024 Extension] 160K volumes (42M slices) datasets, new segmentation datasets, 31M-1.2B pre-trained models, various pre-training recipes, 50+ downstream tasks implementation
VPS融合怪服务器测评项目(VPS Fusion Monster Server Test Script)(尽量做最全能测试服务器的脚本)
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
XcodeBenchmark measures the compilation time of a large codebase on iMac, MacBook, and Mac Pro
YABS - a simple bash script to estimate Linux server performance using fio, iperf3, & Geekbench
[ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs and CPUs via OpenCL. Free for non-commercial use.
Human Benchmark is a Flutter app for Android, it has many tests to test your abilities.
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
Human Benchmark is a Flutter app for Android, it has many tests to test your abilities.
This repo contains the code and data for "MEGA-Bench Scaling Multimodal Evaluation to over 500 Real-World Tasks"
[NeurIPS 2024] Touchstone - Benchmarking AI on 5,172 o.o.d. CT volumes and 9 anatomical structures
Official repo for AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated by AI
SustainDC is a set of Python environments for Data Center simulation and control using Heterogeneous Multi Agent Reinforcement Learning. Includes customizable environments for workload scheduling, coo...
A Python tool to evaluate the performance of VLM on the medical domain.
🔥 Aurora Series: A more efficient multimodal large language model series for video.
MultiCorrupt: A benchmark for robust multi-modal 3D object detection, evaluating LiDAR-Camera fusion models in autonomous driving. Includes diverse corruption types (e.g., misalignment, miscalibration...
a simple benchmark testing tool implemented in golang with some small features
An open collaborative repository for reproducible specifications of HPC benchmarks and cross site benchmarking environments
:hugs: AeroPath: An airway segmentation benchmark dataset with challenging pathology
This repository is the official implementation of the paper Convolutional Neural Operators for robust and accurate learning of PDEs
Official code repository of CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph
[NeurIPS 2024] Needle In A Multimodal Haystack (MM-NIAH): A comprehensive benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents.
[IJRR2024] The official repository for the WildScenes: A Benchmark for 2D and 3D Semantic Segmentation in Natural Environments
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
VPS融合怪服务器测评项目(VPS Fusion Monster Server Test Script)(尽量做最全能测试服务器的脚本)
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
[ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?
YABS - a simple bash script to estimate Linux server performance using fio, iperf3, & Geekbench
XcodeBenchmark measures the compilation time of a large codebase on iMac, MacBook, and Mac Pro
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
A MNIST-like fashion product database. Benchmark :point_down:
Human Benchmark is a Flutter app for Android, it has many tests to test your abilities.
[CVPR 2024 Extension] 160K volumes (42M slices) datasets, new segmentation datasets, 31M-1.2B pre-trained models, various pre-training recipes, 50+ downstream tasks implementation
This repo contains the code and data for "MEGA-Bench Scaling Multimodal Evaluation to over 500 Real-World Tasks"
[NeurIPS 2024] Touchstone - Benchmarking AI on 5,172 o.o.d. CT volumes and 9 anatomical structures
The Dawn of Video Generation: Preliminary Explorations with SORA-like Models
🔥 Aurora Series: A more efficient multimodal large language model series for video.
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
[NeurIPS 2024 D&B] Point Cloud Matters: Rethinking the Impact of Different Observation Spaces on Robot Learning
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
VPS测试脚本 | VPS性能测试(VPS基本信息、IO性能、全球测速、ping、回程路由测试)、BBR加速脚本(一种加速TCP的拥堵算法技术)、三网测速脚本(三网测速、流媒体检测)、线路路由测试(Linux VPS回程路由一键测试脚本)
BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval
:hugs: AeroPath: An airway segmentation benchmark dataset with challenging pathology
Official code repository of CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph
A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust, NeurIPS 2024 Track Datasets and Benchmarks)
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".
Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718
CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models
Official code repository of CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph
Awesome Deep Learning for Time-Series Imputation, including a must-read paper list about applying neural networks to impute incomplete time series containing NaN missing values/data
This project collects GPU benchmarks from various cloud providers and compares them to fixed per token costs. Use our tool for efficient LLM GPU selections and cost-effective AI models. LLM provider p...
An open-source toolbox for fast sampling of diffusion models. Official implementations of our works published in ICML, NeurIPS, CVPR.
A list of works on evaluation of visual generation models, including evaluation metrics, models, and systems
Foundation model benchmarking tool. Run any model on any AWS platform and benchmark for performance across instance type and serving stack options.
[NeurIPS 2024 D&B Spotlight🔥] ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
VPS融合怪服务器测评项目(VPS Fusion Monster Server Test Script)(尽量做最全能测试服务器的脚本)
[ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?
YABS - a simple bash script to estimate Linux server performance using fio, iperf3, & Geekbench
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs and CPUs via OpenCL. Free for non-commercial use.
A series of large language models developed by Baichuan Intelligent Technology
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
Official code repository of CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph
Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718
[ACL 2024] User-friendly evaluation framework: Eval Suite & Benchmarks: UHGEval, HaluEval, HalluQA, etc.
A straightforward JavaScript benchmarking tool and REPL with support for ES modules and libraries.
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".
This is a reposotory that includes paper、code and datasets about domain generalization-based fault diagnosis and prognosis. (基于领域泛化的故障诊断和预测,持续更新)
🚀 A comprehensive performance comparison benchmark between different .NET collections.
[ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
[NAACL 2024] MMC: Advancing Multimodal Chart Understanding with LLM Instruction Tuning
TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles.
A benchmark for spaced repetition schedulers/algorithms
A self-ailgnment method for role-play. Benchmark for role-play. Resources for "Large Language Models are Superpositions of All Characters: Attaining Arbitrary Role-play via Self-Alignment".
[ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?