17 results found Sort:
- Filter by Primary Language:
- Python (10)
- Jupyter Notebook (2)
- JavaScript (1)
- PDDL (1)
- Svelte (1)
- +
An extensible benchmark for evaluating large language models on planning
Created
2022-05-28
34 commits to main branch, last one 4 days ago
🔥 A list of tools, frameworks, and resources for building AI web agents
Created
2025-03-06
13 commits to main branch, last one 12 days ago
A comprehensive set of LLM benchmark scores and provider prices.
Created
2024-09-07
87 commits to main branch, last one 16 days ago
What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks
Created
2023-05-21
62 commits to main branch, last one 7 months ago
BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models
Created
2024-08-21
88 commits to main branch, last one 27 days ago
Official repository of MMGenBench
Created
2024-11-18
8 commits to main branch, last one 12 days ago
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Created
2023-07-24
1,092 commits to main branch, last one about a month ago
How good are LLMs at chemistry?
Created
2023-05-16
1,124 commits to dev branch, last one 7 days ago
Language Model for Mainframe Modernization
Created
2024-08-02
30 commits to main branch, last one 6 months ago
Benchmark that evaluates LLMs using 601 NYT Connections puzzles extended with extra trick words
Created
2024-10-15
38 commits to master branch, last one a day ago
Thematic Generalization Benchmark: measures how effectively various LLMs can infer a narrow or specific "theme" (category/rule) from a small set of examples and anti-examples, then detect which item t...
Created
2025-01-14
29 commits to main branch, last one a day ago
CompBench evaluates the comparative reasoning of multimodal large language models (MLLMs) with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, st...
Created
2024-07-23
4 commits to main branch, last one 7 months ago
Develop reliable AI apps
Created
2024-11-25
51 commits to main branch, last one 8 days ago
Training and Benchmarking LLMs for Code Preference.
Created
2024-10-22
10 commits to main branch, last one 4 months ago
The data and implementation for the experiments in the paper "Flows: Building Blocks of Reasoning and Collaborating AI".
Created
2023-08-02
6 commits to main branch, last one about a year ago
This repository has no description...
Created
2023-08-04
17 commits to main branch, last one 6 days ago
Restore safety in fine-tuned language models through task arithmetic
Created
2024-02-17
83 commits to main branch, last one 11 months ago