3 results found Sort:

119
1.3k
apache-2.0
17
Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
Created 2023-05-15
111 commits to main branch, last one 4 months ago
BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).
Created 2023-06-14
3 commits to main branch, last one about a year ago