Trending repositories for topic benchmark
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
HTTP(S) benchmark tools, testing/debugging, & restAPI (RESTful)
[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
YABS - a simple bash script to estimate Linux server performance using fio, iperf3, & Geekbench
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs via OpenCL. Free for non-commercial use.
OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
Awesome diffusion Video-to-Video (V2V). A collection of paper on diffusion model-based video editing, aka. video-to-video (V2V) translation. And a video editing benchmark code.
BigCodeBench: Benchmarking Code Generation Towards AGI
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
Awesome diffusion Video-to-Video (V2V). A collection of paper on diffusion model-based video editing, aka. video-to-video (V2V) translation. And a video editing benchmark code.
A benchmark for spaced repetition schedulers/algorithms
BigCodeBench: Benchmarking Code Generation Towards AGI
xFinder: Robust and Pinpoint Answer Extraction for Large Language Models
A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust)
This is a reposotory that includes paper、code and datasets about domain generalization-based fault diagnosis and prognosis. (基于领域泛化的故障诊断和预测,持续更新)
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
A list of works on evaluation of visual generation models, including evaluation metrics, models, and systems
Awesome Deep Learning for Time-Series Imputation, including a must-read paper list about applying neural networks to impute incomplete time series containing NaN missing values/data
Foundation model benchmarking tool. Run any model on any AWS platform and benchmark for performance across instance type and serving stack options.
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
YABS - a simple bash script to estimate Linux server performance using fio, iperf3, & Geekbench
The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs via OpenCL. Free for non-commercial use.
OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
HTTP(S) benchmark tools, testing/debugging, & restAPI (RESTful)
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
Code for "FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models (ACL 2024)"
A benchmark for spaced repetition schedulers/algorithms
xFinder: Robust and Pinpoint Answer Extraction for Large Language Models
[ECCV 2024] Official PyTorch Implementation of "How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs"
A list of works on evaluation of visual generation models, including evaluation metrics, models, and systems
A small OpenCL benchmark program to measure peak GPU/CPU performance.
BigCodeBench: Benchmarking Code Generation Towards AGI
ToMBench: Benchmarking Theory of Mind in Large Language Models, ACL 2024.
Awesome diffusion Video-to-Video (V2V). A collection of paper on diffusion model-based video editing, aka. video-to-video (V2V) translation. And a video editing benchmark code.
RAID is the largest and most challenging benchmark for machine-generated text detectors. (ACL 2024)
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
A self-ailgnment method for role-play. Benchmark for role-play. Resources for "Large Language Models are Superpositions of All Characters: Attaining Arbitrary Role-play via Self-Alignment".
This project collects GPU benchmarks from various cloud providers and compares them to fixed per token costs. Use our tool for efficient LLM GPU selections and cost-effective AI models. LLM provider p...
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
This project collects GPU benchmarks from various cloud providers and compares them to fixed per token costs. Use our tool for efficient LLM GPU selections and cost-effective AI models. LLM provider p...
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs via OpenCL. Free for non-commercial use.
VPS融合怪服务器测评脚本(VPS Fusion Monster Server Test Script)(尽量做最全能测试服务器的脚本)
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
YABS - a simple bash script to estimate Linux server performance using fio, iperf3, & Geekbench
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
An objective comparison of multiple frameworks that allow us to "transform" our web apps to desktop applications.
OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
Benchmarks of approximate nearest neighbor libraries in Python
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
RAID is the largest and most challenging benchmark for machine-generated text detectors. (ACL 2024)
xFinder: Robust and Pinpoint Answer Extraction for Large Language Models
An elegant RTSS Overlay to showcase your benchmark stats in style.
Benchmark for LLM Reasoning & Understanding with Challenging Tasks from Real Users.
High-precision and consistent benchmarking framework/harness for Rust
Official code repository of CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph
BigCodeBench: Benchmarking Code Generation Towards AGI
Awesome diffusion Video-to-Video (V2V). A collection of paper on diffusion model-based video editing, aka. video-to-video (V2V) translation. And a video editing benchmark code.
A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust)
Code for "FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models (ACL 2024)"
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
A list of works on evaluation of visual generation models, including evaluation metrics, models, and systems
Awesome Deep Learning for Time-Series Imputation, including a must-read paper list about applying neural networks to impute incomplete time series containing NaN missing values/data
S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models
This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".
A benchmark that challenges language models to code solutions for scientific problems
[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718
[CVPR'24] HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".
CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models
This project collects GPU benchmarks from various cloud providers and compares them to fixed per token costs. Use our tool for efficient LLM GPU selections and cost-effective AI models. LLM provider p...
[ACL 2024] User-friendly evaluation framework: Eval Suite & Benchmarks: UHGEval, HaluEval, HalluQA
Official code repository of CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph
Foundation model benchmarking tool. Run any model on any AWS platform and benchmark for performance across instance type and serving stack options.
ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation
A series of large language models developed by Baichuan Intelligent Technology
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
VPS融合怪服务器测评脚本(VPS Fusion Monster Server Test Script)(尽量做最全能测试服务器的脚本)
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
YABS - a simple bash script to estimate Linux server performance using fio, iperf3, & Geekbench
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs via OpenCL. Free for non-commercial use.
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Benchmarks of approximate nearest neighbor libraries in Python
Rigourous evaluation of LLM-synthesized code - NeurIPS 2023
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718
Official code repository of CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph
A straightforward JavaScript benchmarking tool and REPL with support for ES modules and libraries.
A large-scale benchmark for machine learning methods in fluid dynamics
This is a reposotory that includes paper、code and datasets about domain generalization-based fault diagnosis and prognosis. (基于领域泛化的故障诊断和预测,持续更新)
🚀 A comprehensive performance comparison benchmark between different .NET collections.
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".
This repository is the official implementation of the paper Convolutional Neural Operators for robust and accurate learning of PDEs
Easily download and evaluate pre-trained Visual Place Recognition methods. Code built for the ICCV 2023 paper "EigenPlaces: Training Viewpoint Robust Models for Visual Place Recognition"
[ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM
Benchmark for LLM Reasoning & Understanding with Challenging Tasks from Real Users.
A self-ailgnment method for role-play. Benchmark for role-play. Resources for "Large Language Models are Superpositions of All Characters: Attaining Arbitrary Role-play via Self-Alignment".