Search Results - RepositoryStats

LLMs-Planning karthikv792

35

329

mit

6

An extensible benchmark for evaluating large language models on planning

llms pddl planning llms-planning llms-reasoning benchmark-suite llms-benchmarking

Created 2022-05-28

34 commits to main branch, last one 4 days ago

awesome-web-agents steel-dev

13

179

other

4

🔥 A list of tools, frameworks, and resources for building AI web agents

ai llms ai-agents llms-benchmarking browser-automation

Created 2025-03-06

13 commits to main branch, last one 12 days ago

LLMStats JonathanChavezTamales

13

157

other

6

A comprehensive set of LLM benchmark scores and provider prices.

llm llmops llm-agents llm-evaluation llms-benchmarking

Created 2024-09-07

87 commits to main branch, last one 16 days ago

ChemLLMBench ChemFoundationModels

6

139

unknown

5

What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks

llm nlp benchmark chemistry ai4science llms-benchmarking large-language-models

Created 2023-05-21

62 commits to main branch, last one 7 months ago

BackdoorLLM bboylyg

8

120

mit

2

BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models

llms backdoor llms-benchmarking

Created 2024-08-21

88 commits to main branch, last one 27 days ago

MMGenBench lerogo

5

119

unknown

3

Official repository of MMGenBench

mllm mmgenbench llms-benchmarking

Created 2024-11-18

8 commits to main branch, last one 12 days ago

parea-sdk-py parea-ai

6

77

apache-2.0

1

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

llm llmops metrics llm-eval llm-tools generative-ai llm-evaluation good-first-issue llms-benchmarking prompt-engineering llm-evaluation-toolkit llm-evaluation-framework

Created 2023-07-24

1,092 commits to main branch, last one about a month ago

chembench lamalab-org

8

65

mit

2

How good are LLMs at chemistry?

llm llms safety benchmark chemistry machine-learning llms-benchmarking materials-science

Created 2023-05-16

1,124 commits to dev branch, last one 7 days ago

XMainframe FSoft-AI4Code

5

50

apache-2.0

4

Language Model for Mainframe Modernization

cobol codellm mainframe migration llms-benchmarking code-summarization

Created 2024-08-02

30 commits to main branch, last one 6 months ago

nyt-connections lechmazur

3

44

unknown

7

Benchmark that evaluates LLMs using 601 NYT Connections puzzles extended with extra trick words

llm gpt-4o gpt-4-5 puzzles testing benchmark reasoning sonnet3-7 evaluation llms-benchmarking

Created 2024-10-15

38 commits to master branch, last one a day ago

generalization lechmazur

1

41

unknown

3

Thematic Generalization Benchmark: measures how effectively various LLMs can infer a narrow or specific "theme" (category/rule) from a small set of examples and anti-examples, then detect which item t...

llm llms gpt-4-5 benchmark sonnet3-7 evaluation generalization llms-benchmarking

Created 2025-01-14

29 commits to main branch, last one a day ago

CompBench RaptorMai

2

35

other

1

CompBench evaluates the comparative reasoning of multimodal large language models (MLLMs) with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, st...