9 results found Sort:

PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Sa...
Created 2022-10-25
2 commits to main branch, last one about a year ago
How to Make Safe AI? Let's Discuss! 💡|💬|🙌|📚
Created 2023-02-27
11 commits to main branch, last one about a year ago
Code accompanying the paper Pretraining Language Models with Human Preferences
Created 2023-02-20
5 commits to master branch, last one 4 months ago
📚 A curated list of papers & technical articles on AI Quality & Safety
Created 2023-04-19
28 commits to main branch, last one about a year ago
A curated list of awesome resources for getting-started-with and staying-in-touch-with Artificial Intelligence Alignment research.
Created 2018-11-16
19 commits to master branch, last one 11 months ago
Sparse probing paper full code.
Created 2023-05-02
5 commits to main branch, last one 6 months ago
Directional Preference Alignment
Created 2024-02-27
10 commits to main branch, last one about a month ago