6 results found Sort:

Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.
Created 2024-10-10
78 commits to master branch, last one 3 days ago
Ranking LLMs on agentic tasks
Created 2025-02-10
7 commits to main branch, last one 2 months ago
31
89
mit
8
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
Created 2024-08-08
552 commits to main branch, last one 19 hours ago
one click to open multi AI sites | 一键打开多个 AI 站点,查看 AI 结果
Created 2020-05-20
58 commits to master branch, last one 2 months ago
Code scanner to check for issues in prompts and LLM calls
Created 2025-03-14
51 commits to master branch, last one 9 days ago
Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation met...
Created 2024-10-22
12 commits to master branch, last one 26 days ago