Computational lesion analysis using multilingual LLMs to probe brain alignment. Method for dissecting shared vs language-specific neural representations in LLMs via targeted ablation. Activation: computational lesions, multilingual LLM brain alignment, language-specific neural representations, lesion analysis LLM, shared representations, LLM neuroscience, brain alignment, ablation LLM
This methodology uses computational lesion analysis to probe how multilingual Large Language Models (LLMs) represent language in their internal activations, and how these representations align with human brain activity measured via fMRI.
By selectively ablating (zeroing out) specific layers, attention heads, or neurons in multilingual LLMs, researchers can determine whether the model's internal representations are:
This approach bridges the gap between artificial language processing and human neurobiology, revealing fundamental principles about how multilingual processing works.
Adapted from neuropsychology, where brain lesions reveal function-location mapping:
Key insight: By systematically ablating different parts of an LLM, you can map which components are responsible for which functions, just as brain lesion studies map cognitive functions to brain regions.
Multilingual LLMs may process different languages in fundamentally different ways:
| Representation Type | Description | Evidence |
|---|---|---|
| Shared | Common internal code across languages | Same neurons activate for equivalent concepts in EN/FR/ZH |
| Language-Specific | Unique processing per language | Different neural pathways for different languages |
Key Finding: Some LLMs show separable alignment patterns where:
The methodology uses fMRI data to validate LLM internal representations:
Workflow:
1. Present same stimuli in multiple languages to human subjects (fMRI)
2. Feed same stimuli through multilingual LLM
3. Extract activations from each layer
4. Train encoding model: LLM activations → brain activity
5. Compare alignment scores across languages
6. Perform computational lesions to test necessity
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
def extract_layer_activations(model, tokenizer, texts, layers=None):
"""
Extract hidden states from specific layers of a multilingual LLM.
Args:
model: Pretrained multilingual LLM (e.g., mT5, XLM-R)
texts: List of texts in different languages
layers: List of layer indices to extract (None = all)
Returns:
Dictionary of layer index -> activations tensor
"""
inputs = tokenizer(texts, return_tensors="pt", padding=True)
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
# hidden_states is a tuple: (embeddings, layer_0, layer_1, ..., final)
all_hidden = outputs.hidden_states
if layers is None:
layers = range(len(all_hidden))
return {i: all_hidden[i] for i in layers}
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
import numpy as np
def compute_brain_alignment(llm_activations, brain_data, alpha=1.0):
"""
Measure how well LLM activations predict brain activity.
Args:
llm_activations: [n_samples, n_features] from LLM layer
brain_data: [n_samples, n_voxels] fMRI data
Returns:
alignment_score: Mean cross-validated R² score
"""
model = Ridge(alpha=alpha)
scores = cross_val_score(model, llm_activations, brain_data, cv=5)
return scores.mean()
def perform_computational_lesion(model, tokenizer, texts,
lesion_layer, lesion_type="zero"):
"""
Ablate a specific layer and measure impact.
Args:
lesion_layer: Index of layer to ablate
lesion_type: "zero" (set to 0), "noise" (add noise), "shuffle"
Returns:
degraded_activations: Activations after lesion
"""
inputs = tokenizer(texts, return_tensors="pt", padding=True)
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
hidden = list(outputs.hidden_states)
if lesion_type == "zero":
hidden[lesion_layer] = torch.zeros_like(hidden[lesion_layer])
elif lesion_type == "noise":
hidden[lesion_layer] = hidden[lesion_layer] + torch.randn_like(hidden[lesion_layer])
return hidden
def analyze_representation_sharing(alignment_scores, languages):
"""
Determine if representations are shared or language-specific.
Args:
alignment_scores: Dict of {language: {layer: score}}
languages: List of language codes
Returns:
shared_layers: Layers with consistent alignment across languages
specific_layers: Layers with language-varying alignment
"""
all_layers = list(alignment_scores[languages[0]].keys())
shared_layers = []
specific_layers = []
for layer in all_layers:
scores = [alignment_scores[lang][layer] for lang in languages]
variance = np.var(scores)
if variance < 0.01: # Low variance = shared
shared_layers.append(layer)
else: # High variance = language-specific
specific_layers.append(layer)
return shared_layers, specific_layers