2 results found Sort:

General-purpose activation steering library
Created 2024-08-23
35 commits to main branch, last one 7 days ago
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors
Created 2024-06-13
4 commits to main branch, last one 5 months ago