Search Results - RepositoryStats

openai-api-rs dongri

69

376

mit

9

OpenAI API client library for Rust (unofficial)

o1 api rust gpt-4 gpt-4o openai gpt-4-5 deepseek realtime openrouter gpt-4o-mini gpt-3-5-turbo

Created 2022-12-12

313 commits to main branch, last one 7 days ago

writing lechmazur

4

137

unknown

5

This benchmark tests how well LLMs incorporate a set of 10 mandatory story elements (characters, objects, core concepts, attributes, motivations, etc.) in a short creative story

o1 llm llama claude gemini gpt-4-5 deepseek deepseek-r1 claude-3-7-sonnet

Created 2025-01-05

44 commits to main branch, last one 2 days ago

elimination_game lechmazur

3

76

unknown

3

A multi-player tournament benchmark that tests LLMs in social reasoning, strategy, and deception. Players engage in public and private conversations, form alliances, and vote to eliminate each other

llm eval game gpt-4-5 o3-mini benchmark deepseek-r1 multi-agent strategy-game claude-3-7-sonnet

Created 2025-02-22

27 commits to main branch, last one 2 days ago

nyt-connections lechmazur

3

44

unknown

7

Benchmark that evaluates LLMs using 601 NYT Connections puzzles extended with extra trick words

llm gpt-4o gpt-4-5 puzzles testing benchmark reasoning sonnet3-7 evaluation llms-benchmarking

Created 2024-10-15

38 commits to master branch, last one 2 days ago

step_game lechmazur

3

42

unknown

2

Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure. A multi-player “step-race” that challenges LLMs to engage in public conversation before secretly picking a mo...

o1 llm eval game gpt-4o gpt-4-5 o3-mini deepseek benchmark sonnet3-7 evaluation deepseek-r1 multi-agent

Created 2025-01-21

36 commits to main branch, last one 2 days ago

generalization lechmazur

1

41

unknown

3

Thematic Generalization Benchmark: measures how effectively various LLMs can infer a narrow or specific "theme" (category/rule) from a small set of examples and anti-examples, then detect which item t...

llm llms gpt-4-5 benchmark sonnet3-7 evaluation generalization llms-benchmarking

Created 2025-01-14

29 commits to main branch, last one 2 days ago