10 results found Sort:
- Filter by Primary Language:
- Jupyter Notebook (3)
- Python (3)
- HTML (2)
- +
Examining how large language models (LLMs) perform across various synthetic regression tasks when given (input, output) examples in their context, without any parameter update
Created
2024-01-16
62 commits to main branch, last one 7 months ago
Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.
Created
2024-10-10
79 commits to master branch, last one 7 days ago
A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessmen...
Created
2024-04-02
394 commits to main branch, last one 24 hours ago
A benchmark for prompt injection detection systems.
Created
2024-03-27
60 commits to main branch, last one 2 days ago
A collection of LLM related papers, thesis, tools, datasets, courses, open source models, benchmarks
Created
2024-01-18
84 commits to main branch, last one 6 months ago
Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"
Created
2024-06-11
32 commits to main branch, last one 2 months ago
[AAAI 2025] ORQA is a new QA benchmark designed to assess the reasoning capabilities of LLMs in a specialized technical domain of Operations Research. The benchmark evaluates whether LLMs can emulate ...
Created
2024-12-21
18 commits to main branch, last one about a month ago
LLM-KG-Bench is a Framework and task collection for automated benchmarking of Large Language Models (LLMs) on Knowledge Graph (KG) related tasks.
Created
2023-05-24
524 commits to main branch, last one 29 days ago
Program synthesis for 3D spatial reasoning
Created
2024-12-10
5 commits to main branch, last one 2 months ago
Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation met...
Created
2024-10-22
12 commits to master branch, last one about a month ago