Statistics for topic benchmark
RepositoryStats tracks 584,796 Github repositories, of these 757 are tagged with the benchmark topic. The most common primary language for repositories using this topic is Python (303). Other languages include: C++ (52), Jupyter Notebook (51), Go (49), JavaScript (27), C (23), Java (22), TypeScript (22), Shell (20), Rust (19)
Stargazers over time for topic benchmark
Most starred repositories for topic benchmark (view more)
Trending repositories for topic benchmark (view more)
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
This repo contains the code and data for "MEGA-Bench Scaling Multimodal Evaluation to over 500 Real-World Tasks"
[NeurIPS 2024] Touchstone - Benchmarking AI on 5,172 o.o.d. CT volumes and 9 anatomical structures
Human Benchmark is a Flutter app for Android, it has many tests to test your abilities.
SustainDC is a set of Python environments for Data Center simulation and control using Heterogeneous Multi Agent Reinforcement Learning. Includes customizable environments for workload scheduling, coo...
VPS融合怪服务器测评项目(VPS Fusion Monster Server Test Script)(尽量做最全能测试服务器的脚本)
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Human Benchmark is a Flutter app for Android, it has many tests to test your abilities.
This repo contains the code and data for "MEGA-Bench Scaling Multimodal Evaluation to over 500 Real-World Tasks"
[NeurIPS 2024] Touchstone - Benchmarking AI on 5,172 o.o.d. CT volumes and 9 anatomical structures
Official repo for AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated by AI
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
VPS融合怪服务器测评项目(VPS Fusion Monster Server Test Script)(尽量做最全能测试服务器的脚本)
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Human Benchmark is a Flutter app for Android, it has many tests to test your abilities.
[CVPR 2024 Extension] 160K volumes (42M slices) datasets, new segmentation datasets, 31M-1.2B pre-trained models, various pre-training recipes, 50+ downstream tasks implementation
This repo contains the code and data for "MEGA-Bench Scaling Multimodal Evaluation to over 500 Real-World Tasks"
[NeurIPS 2024] Touchstone - Benchmarking AI on 5,172 o.o.d. CT volumes and 9 anatomical structures
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
VPS融合怪服务器测评项目(VPS Fusion Monster Server Test Script)(尽量做最全能测试服务器的脚本)
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
Official code repository of CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph
Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718