Statistics for topic benchmark
RepositoryStats tracks 518,991 Github repositories, of these 636 are tagged with the benchmark topic. The most common primary language for repositories using this topic is Python (239). Other languages include: C++ (47), Go (46), Jupyter Notebook (31), C (23), JavaScript (23), Java (22), TypeScript (19), Rust (17), Shell (16)
Stargazers over time for topic benchmark
Most starred repositories for topic benchmark (view more)
Trending repositories for topic benchmark (view more)
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
This is the official implementation of "LLM-QBench: A Benchmark Towards the Best Practice for Post-training Quantization of Large Language Models", and it is also an efficient LLM compression tool wit...
A small OpenCL benchmark program to measure peak GPU/CPU performance.
This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.org/abs/2404.12390
CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models
Awesome Deep Learning Resources for Time-Series Imputation, including a must-read paper list about using deep learning neural networks to impute incomplete time series containing NaN missing values/da...
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
This is the official implementation of "LLM-QBench: A Benchmark Towards the Best Practice for Post-training Quantization of Large Language Models", and it is also an efficient LLM compression tool wit...
This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.org/abs/2404.12390
A small OpenCL benchmark program to measure peak GPU/CPU performance.
Awesome Deep Learning Resources for Time-Series Imputation, including a must-read paper list about using deep learning neural networks to impute incomplete time series containing NaN missing values/da...
This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.org/abs/2404.12390
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
VPS融合怪服务器测评脚本(VPS Fusion Monster Server Test Script)(尽量做最全能测试服务器的脚本)
This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.org/abs/2404.12390
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
A list of works on evaluation of visual generation models, including evaluation metrics, models, and systems
A series of large language models developed by Baichuan Intelligent Technology
A 13B large language model developed by Baichuan Intelligent Technology
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".
A series of large language models developed by Baichuan Intelligent Technology
A 13B large language model developed by Baichuan Intelligent Technology
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
A 13B large language model developed by Baichuan Intelligent Technology
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks