9 results found Sort:

45
537
apache-2.0
9
Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions
Created 2023-02-06
540 commits to main branch, last one 17 days ago
Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform ...
Created 2024-01-13
5 commits to main branch, last one 4 months ago
Interpreting how transformers simulate agents performing RL tasks
Created 2022-12-17
725 commits to main branch, last one 8 months ago
🧠 Starter templates for doing interpretability research
Created 2022-10-31
17 commits to main branch, last one 11 months ago
Sparse and discrete interpretability tool for neural networks
Created 2022-12-13
20 commits to main branch, last one 4 months ago
Sparse probing paper full code.
Created 2023-05-02
5 commits to main branch, last one 6 months ago
Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".
Created 2024-02-16
24 commits to main branch, last one 3 months ago
Steering vectors for transformer language models in Pytorch / Huggingface
Created 2024-01-18
53 commits to main branch, last one 2 months ago