6 results found Sort:
- Filter by Primary Language:
- HTML (1)
- Jupyter Notebook (1)
- Python (1)
- TypeScript (1)
- +
Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.
Created
2024-10-10
78 commits to master branch, last one 3 days ago
Ranking LLMs on agentic tasks
Created
2025-02-10
7 commits to main branch, last one 2 months ago
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
Created
2024-08-08
552 commits to main branch, last one 19 hours ago
one click to open multi AI sites | 一键打开多个 AI 站点,查看 AI 结果
Created
2020-05-20
58 commits to master branch, last one 2 months ago
Code scanner to check for issues in prompts and LLM calls
Created
2025-03-14
51 commits to master branch, last one 9 days ago
Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation met...
Created
2024-10-22
12 commits to master branch, last one 26 days ago