35 results found Sort:

119
1.4k
apache-2.0
18
Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
Created 2023-05-15
111 commits to main branch, last one 6 months ago
101
1.3k
apache-2.0
35
Secrets of RLHF in Large Language Models Part I: PPO
Created 2023-07-05
47 commits to main branch, last one 9 months ago
26
390
apache-2.0
11
Open Source LLM toolkit to build trustworthy LLM applications. TigerArmor (AI safety), TigerRAG (embedding, RAG), TigerTune (fine-tuning)
Created 2023-10-23
137 commits to main branch, last one about a year ago
PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Sa...
Created 2022-10-25
2 commits to main branch, last one 2 years ago
Aligning AI With Shared Human Values (ICLR 2021)
Created 2020-08-06
25 commits to master branch, last one about a year ago
[NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking
Created 2023-01-13
16 commits to main branch, last one 5 months ago
15
214
apache-2.0
2
RuLES: a benchmark for evaluating rule-following in language models
Created 2023-11-03
32 commits to main branch, last one 27 days ago
Code accompanying the paper Pretraining Language Models with Human Preferences
Created 2023-02-20
5 commits to master branch, last one 10 months ago
How to Make Safe AI? Let's Discuss! 💡|💬|🙌|📚
Created 2023-02-27
11 commits to main branch, last one about a year ago
📚 A curated list of papers & technical articles on AI Quality & Safety
Created 2023-04-19
28 commits to main branch, last one about a year ago
16
165
apache-2.0
3
An unrestricted attack based on diffusion models that can achieve both good transferability and imperceptibility.
Created 2023-04-29
16 commits to main branch, last one 2 months ago
Toolkits to create a human-in-the-loop approval layer to monitor and guide AI agents workflow in real-time.
Created 2024-10-13
146 commits to main branch, last one 25 days ago
13
123
apache-2.0
4
[ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use
Created 2023-09-26
13 commits to main branch, last one 9 months ago
BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).
Created 2023-06-14
3 commits to main branch, last one about a year ago
Attack to induce LLMs within hallucinations
Created 2023-09-29
22 commits to master branch, last one about a year ago
19
108
apache-2.0
12
[CCS'24] SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models
Created 2023-12-04
17 commits to main branch, last one 2 months ago
Safety Score for Pre-Trained Language Models
Created 2022-07-02
31 commits to main branch, last one about a year ago
LangFair is a Python library for conducting use-case level LLM bias and fairness assessments
Created 2024-09-20
183 commits to main branch, last one 20 hours ago
4
85
bsd-2-clause
1
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
Created 2023-10-08
31 commits to main branch, last one 7 months ago
[SafeAI'21] Feature Space Singularity for Out-of-Distribution Detection.
Created 2020-08-05
15 commits to master branch, last one 3 years ago
A project to add scalable state-of-the-art out-of-distribution detection (open set recognition) support by changing two lines of code! Perform efficient inferences (i.e., do not increase inference tim...
Created 2019-08-16
50 commits to master branch, last one 2 years ago
10
67
apache-2.0
5
[ICCV2021 Oral] Fooling LiDAR by Attacking GPS Trajectory
Created 2020-10-06
27 commits to master branch, last one 2 years ago
A curated list of awesome resources for Artificial Intelligence Alignment research
Created 2018-11-16
19 commits to master branch, last one about a year ago
A curated list of awesome academic research, books, code of ethics, data sets, institutes, newsletters, principles, podcasts, reports, tools, regulations and standards related to Responsible, Trustwor...
Created 2021-09-05
296 commits to main branch, last one a day ago
Sparse probing paper full code.
Created 2023-05-02
5 commits to main branch, last one about a year ago
11
50
apache-2.0
6
[AAAI 2025] Official repository of Imitate Before Detect: Aligning Machine Stylistic Preference for Machine-Revised Text Detection
Created 2024-11-25
12 commits to main branch, last one 4 days ago