11 results found Sort:

PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Sa...
Created 2022-10-25
2 commits to main branch, last one 2 years ago
Code accompanying the paper Pretraining Language Models with Human Preferences
Created 2023-02-20
5 commits to master branch, last one 10 months ago
How to Make Safe AI? Let's Discuss! 💡|💬|🙌|📚
Created 2023-02-27
11 commits to main branch, last one about a year ago
📚 A curated list of papers & technical articles on AI Quality & Safety
Created 2023-04-19
28 commits to main branch, last one about a year ago
8
128
unknown
9
[AAAI'25 Oral] "MIA-Tuner: Adapting Large Language Models as Pre-training Text Detector".
Created 2023-12-20
1 commits to main branch, last one 7 days ago
A curated list of awesome resources for Artificial Intelligence Alignment research
Created 2018-11-16
19 commits to master branch, last one about a year ago
A curated list of awesome academic research, books, code of ethics, data sets, institutes, newsletters, principles, podcasts, reports, tools, regulations and standards related to Responsible, Trustwor...
Created 2021-09-05
296 commits to main branch, last one 2 days ago
Sparse probing paper full code.
Created 2023-05-02
5 commits to main branch, last one about a year ago
Directional Preference Alignment
Created 2024-02-27
11 commits to main branch, last one 2 months ago