4 results found Sort:

An easy-to-use Python framework to generate adversarial jailbreak prompts.
Created 2024-01-31
94 commits to master branch, last one 4 days ago
Official repository for the paper "ALERT: A Comprehensive Benchmark for Assessing Large Language Models’ Safety through Red Teaming"
Created 2024-04-06
24 commits to master branch, last one 6 months ago
2
28
unknown
1
Restore safety in fine-tuned language models through task arithmetic
Created 2024-02-17
83 commits to main branch, last one about a year ago
[ICLR 2025] Official implementation for "SafeWatch: An Efficient Safety-Policy Following Video Guardrail Model with Transparent Explanations"
Created 2024-12-08
11 commits to main branch, last one about a month ago