2 results found Sort:
General-purpose activation steering library
Created
2024-08-23
35 commits to main branch, last one 7 days ago
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors
Created
2024-06-13
4 commits to main branch, last one 5 months ago