Statistics for topic benchmark
RepositoryStats tracks 595,858 Github repositories, of these 777 are tagged with the benchmark topic. The most common primary language for repositories using this topic is Python (311). Other languages include: Jupyter Notebook (52), C++ (52), Go (50), JavaScript (27), C (23), Java (22), TypeScript (22), Rust (21), Shell (21)
Stargazers over time for topic benchmark
Most starred repositories for topic benchmark (view more)
Trending repositories for topic benchmark (view more)
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
An agent benchmark with tasks in a simulated software company.
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
An agent benchmark with tasks in a simulated software company.
CAMEL-Bench is an Arabic benchmark for evaluating multimodal models across eight domains with 29,000 questions.
Awesome multi-modal large language paper/project, collections of popular training strategies, e.g., PEFT, LoRA.
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
[ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?
An agent benchmark with tasks in a simulated software company.
CAMEL-Bench is an Arabic benchmark for evaluating multimodal models across eight domains with 29,000 questions.
🔥🔥🔥 Latest Advances on Large Recommendation Models
AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.
A mini-framework for evaluating LLM performance on the Bulls and Cows number guessing game, supporting multiple LLM providers.
🔥🔥🔥 Latest Advances on Large Recommendation Models
VPS融合怪服务器测评项目(VPS Fusion Monster Server Test Script)(尽量做最全能测试服务器的脚本)
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
YABS - a simple bash script to estimate Linux server performance using fio, iperf3, & Geekbench
An agent benchmark with tasks in a simulated software company.
AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.
[Neurips 2024] A benchmark suite for autoregressive neural emulation of PDEs. (≥46 PDEs in 1D, 2D, 3D; Differentiable Physics; Unrolled Training; Rollout Metrics)
A mini-framework for evaluating LLM performance on the Bulls and Cows number guessing game, supporting multiple LLM providers.
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
VPS融合怪服务器测评项目(VPS Fusion Monster Server Test Script)(尽量做最全能测试服务器的脚本)
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
Official code repository of CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph
An open-source toolbox for fast sampling of diffusion models. Official implementations of our works published in ICML, NeurIPS, CVPR.