Search Results - RepositoryStats

3

120

unknown

7

Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.

o1 llm rag llama claude gemini o3-mini benchmark gemini-pro deepseek-r1 leaderboard ai-evaluation confabulations hallucinations language-model llm-benchmarking

Created 2024-10-10

78 commits to master branch, last one 3 days ago

agent-leaderboard rungalileo

11

107

mit

4

Ranking LLMs on agentic tasks

ai llms ai-agents evaluation ai-evaluation

Created 2025-02-10

7 commits to main branch, last one 2 months ago

vivaria METR

31

89

mit

8

Vivaria is METR's tool for running evaluations and conducting agent elicitation research.

ai evals elicitation ai-evaluation

Created 2024-08-08

552 commits to main branch, last one 19 hours ago

AI-Shortcuts taoAIGC

4

65

unknown

8

one click to open multi AI sites ｜一键打开多个 AI 站点，查看 AI 结果

ai llm poe claude gemini chatgpt perplexity ai-evaluation

Created 2020-05-20

58 commits to master branch, last one 2 months ago

kereva-scanner kereva-dev

3

57

apache-2.0

2

Code scanner to check for issues in prompts and LLM calls

ai cli llm linter security evaluation ai-security red-teaming llm-security ai-evaluation code-scanning hallucination ai-code-review ai-performance ai-red-teaming llm-evaluation llm-performance owasp-llm-top-10 prompt-injection

Created 2025-03-14

51 commits to master branch, last one 9 days ago

deception lechmazur

2

26

unknown

2

Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation met...

llm nlp gpt4o llama claude gemini mistral ai-safety ai-security ai-benchmarks ai-evaluation disinformation language-model llm-benchmarking machine-learning model-evaluation

Created 2024-10-22

12 commits to master branch, last one 26 days ago