Search Results - RepositoryStats

22

140

unknown

3

Examining how large language models (LLMs) perform across various synthetic regression tasks when given (input, output) examples in their context, without any parameter update

llm llms sklearn regression llm-inference llm-benchmarking linear-regression regression-models large-language-models

Created 2024-01-16

62 commits to main branch, last one 7 months ago

confabulations lechmazur

4

125

unknown

7

Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.

o1 llm rag llama claude gemini o3-mini benchmark gemini-pro deepseek-r1 leaderboard ai-evaluation confabulations hallucinations language-model llm-benchmarking

Created 2024-10-10

79 commits to master branch, last one 7 days ago

LLMEvaluation alopatenko

9

115

unknown

7

A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessmen...

llm evaluation llm-evaluation llm-benchmarking generative-ai-benchmarking

Created 2024-04-02

394 commits to main branch, last one 24 hours ago

pint-benchmark lakeraai

11

102

mit

4

A benchmark for prompt injection detection systems.

llm benchmark llm-security llm-benchmarking prompt-injection

Created 2024-03-27

60 commits to main branch, last one 2 days ago

LLM-Research asimsinan

8

53

unknown

1

A collection of LLM related papers, thesis, tools, datasets, courses, open source models, benchmarks

llm llms llm-tools llm-theses arxiv-papers llm-datasets llm-research llm-frameworks llm-benchmarking buyuk-dil-modelleri large-language-models

Created 2024-01-18

84 commits to main branch, last one 6 months ago

MJ-Bench MJ-Bench

5

43

mit

1

Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"

reward-models llm-as-a-judge llm-benchmarking multimodal-judge multimodal-foundation-model

Created 2024-06-11

32 commits to main branch, last one 2 months ago

ORQA nl4opt

0

36

unknown

1

[AAAI 2025] ORQA is a new QA benchmark designed to assess the reasoning capabilities of LLMs in a specialized technical domain of Operations Research. The benchmark evaluates whether LLMs can emulate ...

llm ai4or llm4or llm4opt aaai2025 llm4math multi-choice optimization llm-reasoning llm-benchmarking linear-programming operations-research mathematical-modelling mixed-integer-programming

Created 2024-12-21

18 commits to main branch, last one about a month ago

LLM-KG-Bench AKSW

5

34

mpl-2.0

26

LLM-KG-Bench is a Framework and task collection for automated benchmarking of Large Language Models (LLMs) on Knowledge Graph (KG) related tasks.

llm rdf sparql knowledge-graph llm-benchmarking large-language-models

Created 2023-05-24

524 commits to main branch, last one 29 days ago

VADAR damianomarsili

1

29

other

3

Program synthesis for 3D spatial reasoning

3d llms llm-benchmarking program-synthesis spatial-reasoning

Created 2024-12-10

5 commits to main branch, last one 2 months ago

deception lechmazur

2

26

unknown

2

Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation met...

llm nlp gpt4o llama claude gemini mistral ai-safety ai-security ai-benchmarks ai-evaluation disinformation language-model llm-benchmarking machine-learning model-evaluation

Created 2024-10-22

12 commits to master branch, last one about a month ago