25 results found Sort:

215
3.4k
apache-2.0
27
🐢 Open-Source Evaluation & Testing for LLMs and ML models
Created 2022-03-06
9,675 commits to main branch, last one a day ago
104
1.2k
apache-2.0
17
Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
Created 2023-05-15
110 commits to main branch, last one about a month ago
26
385
apache-2.0
11
Open Source LLM toolkit to build trustworthy LLM applications. TigerArmor (AI safety), TigerRAG (embedding, RAG), TigerTune (fine-tuning)
Created 2023-10-23
137 commits to main branch, last one 6 months ago
PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Sa...
Created 2022-10-25
2 commits to main branch, last one about a year ago
[NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking
Created 2023-01-13
15 commits to main branch, last one 3 months ago
Aligning AI With Shared Human Values (ICLR 2021)
Created 2020-08-06
25 commits to master branch, last one about a year ago
14
197
apache-2.0
3
RuLES: a benchmark for evaluating rule-following in language models
Created 2023-11-03
22 commits to main branch, last one about a month ago
How to Make Safe AI? Let's Discuss! 💡|💬|🙌|📚
Created 2023-02-27
11 commits to main branch, last one about a year ago
Code accompanying the paper Pretraining Language Models with Human Preferences
Created 2023-02-20
5 commits to master branch, last one 3 months ago
📚 A curated list of papers & technical articles on AI Quality & Safety
Created 2023-04-19
28 commits to main branch, last one about a year ago
10
111
apache-2.0
3
An unrestricted attack based on diffusion models that can achieve both good transferability and imperceptibility.
Created 2023-04-29
11 commits to main branch, last one 3 months ago
7
90
apache-2.0
4
A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use
Created 2023-09-26
13 commits to main branch, last one 2 months ago
Safety Score for Pre-Trained Language Models
Created 2022-07-02
31 commits to main branch, last one 7 months ago
Attack to induce LLMs within hallucinations
Created 2023-09-29
22 commits to master branch, last one 7 months ago
BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).
Created 2023-06-14
3 commits to main branch, last one 10 months ago
Feature Space Singularity for Out-of-Distribution Detection. (SafeAI 2021)
Created 2020-08-05
15 commits to master branch, last one 3 years ago
A project to add scalable state-of-the-art out-of-distribution detection (open set recognition) support by changing two lines of code! Perform efficient inferences (i.e., do not increase inference tim...
Created 2019-08-16
50 commits to master branch, last one about a year ago
4
68
bsd-2-clause
1
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
Created 2023-10-08
31 commits to main branch, last one 9 days ago
10
66
apache-2.0
5
[ICCV2021 Oral] Fooling LiDAR by Attacking GPS Trajectory
Created 2020-10-06
27 commits to master branch, last one about a year ago
A curated list of awesome resources for getting-started-with and staying-in-touch-with Artificial Intelligence Alignment research.
Created 2018-11-16
19 commits to master branch, last one 10 months ago
A project to improve out-of-distribution detection (open set recognition) and uncertainty estimation by changing a few lines of code in your project! Perform efficient inferences (i.e., do not increas...
Created 2022-05-10
40 commits to master branch, last one about a year ago
Sparse probing paper full code.
Created 2023-05-02
5 commits to main branch, last one 5 months ago
AI Safety Q&A web frontend
Created 2022-02-17
1,059 commits to master branch, last one 6 days ago