Trending repositories for topic benchmark
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
YABS - a simple bash script to estimate Linux server performance using fio, iperf3, & Geekbench
OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Benchmarks of approximate nearest neighbor libraries in Python
This is the official implementation of "LLM-QBench: A Benchmark Towards the Best Practice for Post-training Quantization of Large Language Models", and it is also an efficient LLM compression tool wit...
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
Microbenchmarks comparing the Julia Programming language with other languages
Awesome Time-Series Imputation Papers, including a must-read paper list about using deep learning neural networks to impute incomplete time series containing NaN missing values/data
[CVPR-2024, Highlight, Top 2.8%] Official implementation for "Fast ODE-based Sampling for Diffusion Models in Around 5 Steps".
This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.org/abs/2404.12390
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
YABS - a simple bash script to estimate Linux server performance using fio, iperf3, & Geekbench
OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
Benchmarks of approximate nearest neighbor libraries in Python
xFinder: Robust and Pinpoint Answer Extraction for Large Language Models
Awesome Time-Series Imputation Papers, including a must-read paper list about using deep learning neural networks to impute incomplete time series containing NaN missing values/data
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
This is the official implementation of "LLM-QBench: A Benchmark Towards the Best Practice for Post-training Quantization of Large Language Models", and it is also an efficient LLM compression tool wit...
This repository is the official implementation of the paper Convolutional Neural Operators for robust and accurate learning of PDEs
This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.org/abs/2404.12390
[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
Microbenchmarks comparing the Julia Programming language with other languages
Benchmark of Apple MLX operations on all Apple Silicon chips (GPU, CPU) + MPS and CUDA.
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
xAST评价体系,让安全工具不再“黑盒”. The xAST evaluation benchmark makes security tools no longer a "black box".
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
VPS融合怪服务器测评脚本(VPS Fusion Monster Server Test Script)(尽量做最全能测试服务器的脚本)
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
Video Foundation Models & Data for Multimodal Understanding
YABS - a simple bash script to estimate Linux server performance using fio, iperf3, & Geekbench
A straightforward JavaScript benchmarking tool and REPL with support for ES modules and libraries.
OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
Benchmarks of approximate nearest neighbor libraries in Python
A straightforward JavaScript benchmarking tool and REPL with support for ES modules and libraries.
xFinder: Robust and Pinpoint Answer Extraction for Large Language Models
This is the official implementation of "LLM-QBench: A Benchmark Towards the Best Practice for Post-training Quantization of Large Language Models", and it is also an efficient LLM compression tool wit...
This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.org/abs/2404.12390
Awesome Time-Series Imputation Papers, including a must-read paper list about using deep learning neural networks to impute incomplete time series containing NaN missing values/data
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
This is a benckmark for domain generalization-based fault diagnosis (基于领域泛化的相关代码)
MultiCorrupt: A benchmark for robust multi-modal 3D object detection, evaluating LiDAR-Camera fusion models in autonomous driving. Includes diverse corruption types (e.g., misalignment, miscalibration...
A small OpenCL benchmark program to measure peak GPU/CPU performance.
A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.
[CVPR-2024, Highlight, Top 2.8%] Official implementation for "Fast ODE-based Sampling for Diffusion Models in Around 5 Steps".
Foundation model benchmarking tool. Run any model on Amazon SageMaker and benchmark for performance across instance type and serving stack options.
A series of large language models developed by Baichuan Intelligent Technology
A 13B large language model developed by Baichuan Intelligent Technology
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks
Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718
A series of large language models developed by Baichuan Intelligent Technology
A 13B large language model developed by Baichuan Intelligent Technology
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
VPS融合怪服务器测评脚本(VPS Fusion Monster Server Test Script)(尽量做最全能测试服务器的脚本)
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
YABS - a simple bash script to estimate Linux server performance using fio, iperf3, & Geekbench
[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".
Benchmarks of approximate nearest neighbor libraries in Python
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
A 13B large language model developed by Baichuan Intelligent Technology
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.
Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".
What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks
A PyTorch library for all things Reinforcement Learning (RL) for Combinatorial Optimization (CO)
A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of Optimum's hardware optimizations & quantization schemes.
Easily download and evaluate pre-trained Visual Place Recognition methods. Code built for the ICCV 2023 paper "EigenPlaces: Training Viewpoint Robust Models for Visual Place Recognition"
This is a reposotory that includes paper、code and datasets about domain generalization-based fault diagnosis and prognosis. (基于领域泛化的故障诊断和预测,持续更新)
xAST评价体系,让安全工具不再“黑盒”. The xAST evaluation benchmark makes security tools no longer a "black box".