17 results found Sort:

An extensible benchmark for evaluating large language models on planning
Created 2022-05-28
34 commits to main branch, last one 4 days ago
🔥 A list of tools, frameworks, and resources for building AI web agents
Created 2025-03-06
13 commits to main branch, last one 12 days ago
A comprehensive set of LLM benchmark scores and provider prices.
Created 2024-09-07
87 commits to main branch, last one 16 days ago
What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks
Created 2023-05-21
62 commits to main branch, last one 7 months ago
BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models
Created 2024-08-21
88 commits to main branch, last one 27 days ago
5
119
unknown
3
Official repository of MMGenBench
Created 2024-11-18
8 commits to main branch, last one 12 days ago
6
77
apache-2.0
1
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Created 2023-07-24
1,092 commits to main branch, last one about a month ago
How good are LLMs at chemistry?
Created 2023-05-16
1,124 commits to dev branch, last one 7 days ago
Language Model for Mainframe Modernization
Created 2024-08-02
30 commits to main branch, last one 6 months ago
Benchmark that evaluates LLMs using 601 NYT Connections puzzles extended with extra trick words
Created 2024-10-15
38 commits to master branch, last one a day ago
Thematic Generalization Benchmark: measures how effectively various LLMs can infer a narrow or specific "theme" (category/rule) from a small set of examples and anti-examples, then detect which item t...
Created 2025-01-14
29 commits to main branch, last one a day ago
CompBench evaluates the comparative reasoning of multimodal large language models (MLLMs) with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, st...
Created 2024-07-23
4 commits to main branch, last one 7 months ago
Develop reliable AI apps
Created 2024-11-25
51 commits to main branch, last one 8 days ago
Training and Benchmarking LLMs for Code Preference.
Created 2024-10-22
10 commits to main branch, last one 4 months ago
The data and implementation for the experiments in the paper "Flows: Building Blocks of Reasoning and Collaborating AI".
Created 2023-08-02
6 commits to main branch, last one about a year ago
This repository has no description...
Created 2023-08-04
17 commits to main branch, last one 6 days ago
2
27
unknown
1
Restore safety in fine-tuned language models through task arithmetic
Created 2024-02-17
83 commits to main branch, last one 11 months ago