3 results found Sort:
Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
Created
2023-05-15
111 commits to main branch, last one 4 months ago
BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).
Created
2023-06-14
3 commits to main branch, last one about a year ago
Reading list for adversarial perspective and robustness in deep reinforcement learning.
ai-safety
safe-rlhf
ai-alignment
responsible-ai
adversarial-policies
machine-learning-safety
robust-machine-learning
deep-reinforcement-learning
meta-reinforcement-learning
safe-reinforcement-learning
adversarial-machine-learning
explainable-machine-learning
reinforcement-learning-safety
robust-reinforcement-learning
reinforcement-learning-alignment
artificial-intelligence-alignment
multiagent-reinforcement-learning
adversarial-reinforcement-learning
robust-deep-reinforcement-learning
Created
2023-09-08
15 commits to main branch, last one 4 months ago