17 results found Sort:

82
730
apache-2.0
8
Stanford NLP Python library for understanding and improving PyTorch models via interventions
Created 2023-02-06
687 commits to main branch, last one 2 days ago
This repository collects all relevant resources about interpretability in LLMs
Created 2024-06-30
56 commits to main branch, last one 5 months ago
Decomposing and Editing Predictions by Modeling Model Computation
Created 2024-04-17
12 commits to main branch, last one 10 months ago
For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.
Created 2024-03-19
577 commits to main branch, last one about a month ago
Steering vectors for transformer language models in Pytorch / Huggingface
Created 2024-01-18
68 commits to main branch, last one about a month ago
Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform ...
Created 2024-01-13
5 commits to main branch, last one about a year ago
Interpreting how transformers simulate agents performing RL tasks
Created 2022-12-17
725 commits to main branch, last one about a year ago
Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".
Created 2024-02-16
24 commits to main branch, last one about a year ago
🧠 Starter templates for doing interpretability research
Created 2022-10-31
17 commits to main branch, last one about a year ago
5
63
apache-2.0
3
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
Created 2024-08-07
399 commits to main branch, last one 13 days ago
Sparse and discrete interpretability tool for neural networks
Created 2022-12-13
20 commits to main branch, last one about a year ago
Sparse probing paper full code.
Created 2023-05-02
5 commits to main branch, last one about a year ago
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
Created 2023-10-10
309 commits to main branch, last one 4 months ago
Universal Neurons in GPT2 Language Models
Created 2023-12-26
5 commits to main branch, last one 10 months ago
Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.
Created 2024-07-10
11 commits to main branch, last one 8 months ago
4
25
unknown
1
This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking".
Created 2023-08-16
272 commits to main branch, last one about a year ago