Statistics for topic benchmark
RepositoryStats tracks 559,058 Github repositories, of these 716 are tagged with the benchmark topic. The most common primary language for repositories using this topic is Python (280). Other languages include: C++ (50), Go (48), Jupyter Notebook (45), JavaScript (25), C (23), Java (22), TypeScript (22), Shell (20), Rust (18)
Stargazers over time for topic benchmark
Most starred repositories for topic benchmark (view more)
Trending repositories for topic benchmark (view more)
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
M3DBench introduces a comprehensive 3D instruction-following dataset with support for interleaved multi-modal prompts. Furthermore, M3DBench provides a new benchmark to assess large models across 3D v...
A benchmark for spaced repetition schedulers/algorithms
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
Code for "FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models (ACL 2024)"
[ECCV 2024] Official PyTorch Implementation of "How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs"
This project collects GPU benchmarks from various cloud providers and compares them to fixed per token costs. Use our tool for efficient LLM GPU selections and cost-effective AI models. LLM provider p...
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
This project collects GPU benchmarks from various cloud providers and compares them to fixed per token costs. Use our tool for efficient LLM GPU selections and cost-effective AI models. LLM provider p...
The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs via OpenCL. Free for non-commercial use.
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
RAID is the largest and most challenging benchmark for machine-generated text detectors. (ACL 2024)
Benchmark for LLM Reasoning & Understanding with Challenging Tasks from Real Users.
[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
A series of large language models developed by Baichuan Intelligent Technology
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
VPS融合怪服务器测评脚本(VPS Fusion Monster Server Test Script)(尽量做最全能测试服务器的脚本)
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718
Official code repository of CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph