Search Results - RepositoryStats

pyvene stanfordnlp

82

730

apache-2.0

8

Stanford NLP Python library for understanding and improving PyTorch models via interventions

intervention interpretability activation-patching activation-intervention mechanistic-interpretability

Created 2023-02-06

687 commits to main branch, last one 2 days ago

Awesome-Interpretability-in-Large-Language-Models ruizheliUOA

21

335

cc0-1.0

6

This repository collects all relevant resources about interpretability in LLMs

sparse-autoencoder dictionary-learning mechanistic-interpretability interpretability-and-explainability

Created 2024-06-30

56 commits to main branch, last one 5 months ago

modelcomponents MadryLab

8

138

mit

3

Decomposing and Editing Predictions by Modeling Model Computation

pytorch attribution model-editing interpretability mechanistic-interpretability

Created 2024-04-17

12 commits to main branch, last one 10 months ago

Language-Model-SAEs OpenMOSS

12

106

unknown

5

For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.

interpretability sparse-dictionary sparse-autoencoders mechanistic-interpretability

Created 2024-03-19

577 commits to main branch, last one about a month ago

steering-vectors steering-vectors

8

94

mit

1

Steering vectors for transformer language models in Pytorch / Huggingface

ai gpt nlp pytorch huggingface representation-engineering mechanistic-interpretability

Created 2024-01-18

68 commits to main branch, last one about a month ago

deepdistilling pauljblazek

7

85

other

3

Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform ...

distilling interpretable neurosymbolic explainable-ai domain-adaptation program-synthesis model-distillation knowledge-distillation inductive-logic-programming mechanistic-interpretability out-of-distribution-generalization

Created 2024-01-13

5 commits to main branch, last one about a year ago

DecisionTransformerInterpretability jbloomAus

20

79

mit

3

Interpreting how transformers simulate agents performing RL tasks

reinforcement-learning mechanistic-interpretability

Created 2022-12-17

725 commits to main branch, last one about a year ago

llm-latent-language epfl-dlab

16

73

unknown

3

Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".

llm llama2 multilingual-nlp mechanistic-interpretability

Created 2024-02-16

24 commits to main branch, last one about a year ago

interpretability-starter apartresearch

2

69

unknown

0

🧠 Starter templates for doing interpretability research

alignment-jam interpretability interpretability-jam mechanistic-interpretability

Created 2022-10-31

17 commits to main branch, last one about a year ago

axbench stanfordnlp

5

63

apache-2.0

3

Stanford NLP Python library for benchmarking the utility of LLM interpretability methods

intervention llm-steering interpretability large-language-models mechanistic-interpretability

Created 2024-08-07

399 commits to main branch, last one 13 days ago

codebook-features taufeeque9

3

62

mit

4

Sparse and discrete interpretability tool for neural networks

codebook features transformers language-model interpretability mechanistic-interpretability

Created 2022-12-13

20 commits to main branch, last one about a year ago

sparse-probing-paper wesg52

11

55

mit

2

Sparse probing paper full code.

ai-safety ai-alignment interpretability mechanistic-interpretability

Created 2023-05-02

5 commits to main branch, last one about a year ago

automated-brain-explanations microsoft

6

51

mit

6

Generating and validating natural-language explanations for the brain.

gpt xai fmri gpt4 explanation huggingface data-science neuroscience ai-for-science language-model interpretability machine-learning fmri-data-analysis large-language-models artificial-intelligence interpretable-embeddings automated-interpretability natural-language-processing mechanistic-interpretability

Created 2023-01-30

418 commits to main branch, last one 14 days ago

causalgym aryamanarora

6

41

unknown

1

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

benchmark causality syntaxgym interpretability mechanistic-interpretability

Created 2023-10-10

309 commits to main branch, last one 4 months ago

universal-neurons wesg52

6

27

mit

3

Universal Neurons in GPT2 Language Models

llm ai-safety interpretability mechanistic-interpretability

Created 2023-12-26

5 commits to main branch, last one 10 months ago

arrakis yash-srivastava19

1

26

unknown

1

Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.

garcon anthropic transformer explainable-ai transformerlens mechanistic-interpretability

Created 2024-07-10

11 commits to main branch, last one 8 months ago

finetuning Nix07

4

25

unknown

1

This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking".

finetuning entity-tracking science-of-deep-learning mechanistic-interpretability

Created 2023-08-16

272 commits to main branch, last one about a year ago