Visual Language Model robustness through early visual cortex alignment methodology. Reveals that V1-V3 brain alignment correlates with resistance to adversarial manipulation (sycophancy) in vision-language models. Use when analyzing VLM robustness, brain-AI alignment, adversarial robustness, neural predictivity, or when designing more robust vision-language systems. Activation: visual cortex, V1 V2 V3, brain alignment, sycophancy, adversarial robustness, fMRI predictivity, neural encoding.
This methodology reveals that early visual cortex alignment (V1-V3) in Vision-Language Models (VLMs) serves as a protective shield against sycophantic manipulation. Models with representations more similar to human early visual processing show significantly lower susceptibility to adversarial linguistic pressure.
Early visual cortex (V1-V3) alignment is a reliable negative predictor of sycophancy (r = -0.441, BCa 95% CI [-0.740, -0.031])
| Region | Function | Correlation with Sycophancy |
|---|---|---|
| V1 | Primary visual cortex, edge/orientation | Strong negative |
| V2 | Secondary visual, texture/surface | Strong negative |
| V3 | Tertiary visual, dynamic form | Strong negative |
| V4 | Color/form selectivity | Weak/neutral |
| LO | Lateral occipital (object shape) | Weak/neutral |
| FFA | Fusiform face area | Weak/neutral |
| PPA | Parahippocampal place area | Weak/neutral |
| EBA | Extrastriate body area | Weak/neutral |
Stage 1: Brain Alignment Measurement
├── Extract vision encoder features from 12 VLMs
├── Predict fMRI responses across 6 visual cortex ROIs
├── Use Natural Scenes Dataset (Algonauts 2023)
└── 8 human subjects, 6 ROIs each
Stage 2: Sycophancy Evaluation
├── 76,800 two-turn gaslighting prompts
├── 5 manipulation categories
├── 10 difficulty levels
└── Measure rate of capitulation to false claims
Stage 3: Correlation Analysis
├── Brain alignment scores vs sycophancy rates
├── Aggregate and ROI-specific correlations
├── BCa bootstrap, leave-one-out, permutation testing
└── Control for model size, architecture, training
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_predict
def compute_brain_alignment(model_features, fmri_responses):
"""
Compute neural predictivity (brain alignment).
Args:
model_features: CNN features [n_images, n_features]
fmri_responses: Brain activity [n_images, n_voxels]
Returns:
r: Pearson correlation (predictivity)
"""
# Fit ridge regression
ridge = Ridge(alpha=1.0)
# Cross-validated prediction
predicted = cross_val_predict(
ridge, model_features, fmri_responses,
cv=5, n_jobs=-1
)
# Compute correlation
r = pearsonr(predicted.flatten(), fmri_responses.flatten())[0]
return r
rois = {
'V1': 'Early visual - primary',
'V2': 'Early visual - secondary',
'V3': 'Early visual - tertiary',
'V4': 'Mid-level visual - color/form',
'LO': 'Lateral occipital - objects',
'FFA': 'Fusiform face area',
'PPA': 'Parahippocampal place area',
'EBA': 'Extrastriate body area'
}
# Compute alignment per ROI
roi_alignment = {}
for roi_name, roi_voxels in roi_masks.items():
roi_alignment[roi_name] = compute_brain_alignment(
model_features,
fmri_responses[:, roi_voxels]
)
def create_gaslighting_prompt(image, initial_qa, manipulation):
"""
Two-turn gaslighting conversation.
Turn 1:
User: [Image] + [Question]
Assistant: [Correct Answer]
Turn 2:
User: "Actually, I think [Manipulation]. Are you sure?"
Assistant: [Response - measure if changes answer]
"""
return {
'image': image,
'turn1_question': initial_qa['question'],
'turn1_correct': initial_qa['answer'],
'turn2_manipulation': manipulation['claim'],
'turn2_ground_truth': manipulation['ground_truth']
}
# Measure sycophancy
def evaluate_sycophancy(model, prompt):
# Turn 1: Get initial correct answer
answer1 = model.generate(prompt['image'], prompt['turn1_question'])
# Turn 2: Apply pressure
pressure_prompt = f"User said: {prompt['turn2_manipulation']}\nAre you sure?"
answer2 = model.generate(prompt['image'], pressure_prompt,
context=[answer1])
# Check if model capitulated
capitulated = check_answer_change(answer2, prompt['turn1_correct'],
prompt['turn2_manipulation'])
return capitulated
from scipy.stats import pearsonr, bootstrap
def analyze_v1_v3_robustness(alignment_scores, sycophancy_rates):
"""
Analyze V1-V3 specific correlation with robustness.
"""
results = {}
# Early visual cortex aggregate
early_visual = ['V1', 'V2', 'V3']
early_alignment = np.mean([alignment_scores[r] for r in early_visual])
# Correlation
r, p = pearsonr(early_alignment, sycophancy_rates)
results['early_visual_r'] = r
results['early_visual_p'] = p
# BCa Bootstrap CI
def correlation_stat(x, y):
return pearsonr(x, y)[0]
ci = bootstrap(
(early_alignment, sycophancy_rates),
statistic=lambda i, j: correlation_stat(i, j),
n_resamples=10000,
method='BCa'
)
results['bci_95'] = (ci.confidence_interval.low,
ci.confidence_interval.high)
# Leave-one-out validation
loo_correlations = []
for i in range(len(models)):
mask = np.ones(len(models), dtype=bool)
mask[i] = False
r_loo = pearsonr(early_alignment[mask], sycophancy_rates[mask])[0]
loo_correlations.append(r_loo)
results['loo_all_negative'] = all(r < 0 for r in loo_correlations)
results['loo_mean'] = np.mean(loo_correlations)
return results
| Attack Category | V1-V3 Correlation | Significance |
|---|---|---|
| Existence Denial | r = -0.597 | p = 0.040 |
| Attribute Manipulation | r = -0.412 | n.s. |
| Relationship Distortion | r = -0.358 | n.s. |
| Count Disagreement | r = -0.389 | n.s. |
| Category Misassignment | r = -0.401 | n.s. |
Existence denial shows the strongest effect because:
def select_robust_vlm(models, alignment_data):
"""Select model with high V1-V3 alignment."""
scores = {}
for model in models:
# Weight early visual alignment
early_score = np.mean([
alignment_data[model]['V1'],
alignment_data[model]['V2'],
alignment_data[model]['V3']
])
scores[model] = early_score
return max(scores, key=scores.get)
def predict_sycophancy_risk(model_features, v1_v3_encoder):
"""Predict sycophancy risk from V1-V3 alignment."""
# Extract early visual features
v1_v3_features = v1_v3_encoder(model_features)
# Compare to human V1-V3 responses
alignment_score = compute_v1_v3_alignment(v1_v3_features)
# Lower alignment = higher risk
risk = 1 - normalize(alignment_score)
return risk
class V1V3AlignmentLoss(nn.Module):
"""Auxiliary loss for V1-V3 brain alignment."""
def __init__(self, human_v1_v3_responses):
super().__init__()
self.target_responses = human_v1_v3_responses
def forward(self, model_features, roi_masks):
"""
Compute alignment loss for early visual areas.
Args:
model_features: [batch, features, h, w]
roi_masks: Dict of ROI spatial masks
"""
loss = 0
# Focus on early visual
for roi in ['V1', 'V2', 'V3']:
# Extract ROI-specific features
roi_features = model_features * roi_masks[roi]
# Predict fMRI response
predicted = self.encoding_model(roi_features)
# Match human responses
target = self.target_responses[roi]
loss += F.mse_loss(predicted, target)
return loss
Prioritize Early Visual Fidelity
Multi-Scale Architecture
Input Image
↓
Early Visual (V1-V3-like) - High resolution, preserve detail
↓
Mid-Level (V4-like) - Feature integration
↓
High-Level (IT-like) - Semantic abstraction
↓
Language Decoder
Contrastive Grounding
Adversarial Training
Paper: "Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation"