2 results found Sort:

230
3.4k
mit
18
Test your prompts, agents, and RAGs. Use LLM evals to improve your app's quality and catch problems. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with comman...
Created 2023-04-28
1,361 commits to main branch, last one 20 hours ago
LLM Reasoning and Generation Benchmark. Evaluate LLMs in complex scenarios systematically.
Created 2023-10-22
38 commits to main branch, last one 3 months ago