17 results found Sort:
- Filter by Primary Language:
- Jupyter Notebook (8)
- Python (7)
- +
Stanford NLP Python library for understanding and improving PyTorch models via interventions
Created
2023-02-06
687 commits to main branch, last one 2 days ago
This repository collects all relevant resources about interpretability in LLMs
Created
2024-06-30
56 commits to main branch, last one 5 months ago
Decomposing and Editing Predictions by Modeling Model Computation
Created
2024-04-17
12 commits to main branch, last one 10 months ago
For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.
Created
2024-03-19
577 commits to main branch, last one about a month ago
Steering vectors for transformer language models in Pytorch / Huggingface
Created
2024-01-18
68 commits to main branch, last one about a month ago
Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform ...
Created
2024-01-13
5 commits to main branch, last one about a year ago
Interpreting how transformers simulate agents performing RL tasks
Created
2022-12-17
725 commits to main branch, last one about a year ago
Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".
Created
2024-02-16
24 commits to main branch, last one about a year ago
🧠Starter templates for doing interpretability research
Created
2022-10-31
17 commits to main branch, last one about a year ago
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
Created
2024-08-07
399 commits to main branch, last one 13 days ago
Sparse and discrete interpretability tool for neural networks
Created
2022-12-13
20 commits to main branch, last one about a year ago
Sparse probing paper full code.
Created
2023-05-02
5 commits to main branch, last one about a year ago
Generating and validating natural-language explanations for the brain.
gpt
xai
fmri
gpt4
explanation
huggingface
data-science
neuroscience
ai-for-science
language-model
interpretability
machine-learning
fmri-data-analysis
large-language-models
artificial-intelligence
interpretable-embeddings
automated-interpretability
natural-language-processing
mechanistic-interpretability
Created
2023-01-30
418 commits to main branch, last one 14 days ago
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
Created
2023-10-10
309 commits to main branch, last one 4 months ago
Universal Neurons in GPT2 Language Models
Created
2023-12-26
5 commits to main branch, last one 10 months ago
Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.
Created
2024-07-10
11 commits to main branch, last one 8 months ago
This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking".
Created
2023-08-16
272 commits to main branch, last one about a year ago