Name: Detecting Deepfake Audio in Vishing Attacks
Author: mukul975

Detecting Deepfake Audio in Vishing Attacks | Skills Pool

import librosa
import numpy as np

# Load audio, resample to 16kHz mono
y, sr = librosa.load("suspect_call.wav", sr=16000, mono=True)

# Trim silence from beginning and end
y_trimmed, _ = librosa.effects.trim(y, top_db=25)

# Normalize amplitude to [-1, 1]
y_norm = y_trimmed / np.max(np.abs(y_trimmed))

# Extract 20 MFCCs + delta and delta-delta
mfccs = librosa.feature.mfcc(y=y_norm, sr=sr, n_mfcc=20)
mfcc_delta = librosa.feature.delta(mfccs)
mfcc_delta2 = librosa.feature.delta(mfccs, order=2)

spectral_centroid = librosa.feature.spectral_centroid(y=y_norm, sr=sr)
spectral_bandwidth = librosa.feature.spectral_bandwidth(y=y_norm, sr=sr)
spectral_contrast = librosa.feature.spectral_contrast(y=y_norm, sr=sr)
spectral_rolloff = librosa.feature.spectral_rolloff(y=y_norm, sr=sr)
zero_crossing_rate = librosa.feature.zero_crossing_rate(y_norm)

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

def build_feature_vector(y, sr):
    features = []
    mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=20)
    for coeff in mfccs:
        features.extend([np.mean(coeff), np.std(coeff), np.min(coeff), np.max(coeff)])
    for feat_fn in [librosa.feature.spectral_centroid,
                    librosa.feature.spectral_bandwidth,
                    librosa.feature.spectral_rolloff,
                    librosa.feature.zero_crossing_rate]:
        feat = feat_fn(y=y, sr=sr) if feat_fn != librosa.feature.zero_crossing_rate else feat_fn(y)
        features.extend([np.mean(feat), np.std(feat), np.min(feat), np.max(feat)])
    contrast = librosa.feature.spectral_contrast(y=y, sr=sr)
    for band in contrast:
        features.extend([np.mean(band), np.std(band)])
    return np.array(features)

# Pitch stability analysis - deepfakes often have unnaturally stable F0
f0, voiced_flag, voiced_probs = librosa.pyin(y_norm, fmin=50, fmax=500, sr=sr)
f0_clean = f0[~np.isnan(f0)]
pitch_std = np.std(f0_clean) if len(f0_clean) > 0 else 0
pitch_jitter = np.mean(np.abs(np.diff(f0_clean))) if len(f0_clean) > 1 else 0

import librosa.display
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
librosa.display.specshow(librosa.power_to_db(librosa.feature.melspectrogram(y=y_norm, sr=sr)),
                         sr=sr, ax=axes[0, 0], x_axis='time', y_axis='mel')
axes[0, 0].set_title('Mel Spectrogram')
librosa.display.specshow(mfccs, sr=sr, ax=axes[0, 1], x_axis='time')
axes[0, 1].set_title('MFCCs')

DEEPFAKE AUDIO ANALYSIS REPORT
================================
File:              suspect_executive_call.wav
Duration:          47.3 seconds
Sample Rate:       16000 Hz
Analysis Date:     2026-03-19

CLASSIFICATION RESULT
Verdict:           LIKELY DEEPFAKE (confidence: 94.2%)
Ensemble Score:    RF=0.91, GBT=0.97, Avg=0.94

FEATURE ANOMALIES DETECTED
- MFCC variance in coefficients 13-20: 62% below genuine baseline
- Spectral contrast (4-8 kHz): 0.23 (genuine avg: 0.41)
- Pitch jitter: 0.8 Hz (genuine avg: 2.4 Hz)
- Zero-crossing rate std: 0.003 (genuine avg: 0.011)

SPECTROGRAM ARTIFACTS
- Energy cutoff above 7.8 kHz (consistent with neural vocoder ceiling)
- Banding pattern at 50ms intervals in mel spectrogram
- Missing formant transitions at 12.4s, 23.1s, 35.7s timestamps

RECOMMENDATION
High confidence of AI-generated audio. Recommend out-of-band
verification with the purported speaker. Preserve original audio
file with chain of custody documentation for potential legal action.

Term	Definition
MFCC	Mel-Frequency Cepstral Coefficients; representation of the short-term power spectrum on a mel (perceptual) frequency scale
Spectral Centroid	Weighted mean of frequencies present in the signal; indicates perceived brightness of a sound
Spectral Contrast	Difference in amplitude between peaks and valleys in the spectrum across frequency sub-bands
Vocoder	Signal processing component that synthesizes audio waveforms from acoustic features; used in TTS and voice cloning
Pitch Jitter	Cycle-to-cycle variation in fundamental frequency; natural in human speech, reduced in synthetic speech
Vishing	Voice phishing; social engineering attack conducted via phone calls, increasingly using AI-cloned voices
Formant	Resonant frequencies of the vocal tract that define vowel sounds; transitions between formants are difficult for AI to replicate perfectly

Detecting Deepfake Audio in Vishing Attacks

When to Use

Prerequisites

Workflow

Detecting Deepfake Audio in Vishing Attacks

When to Use

Prerequisites

Workflow

Step 1: Audio Preprocessing

Step 2: Extract Spectral Features

Step 3: Build Feature Vector and Classify

Step 4: Temporal Artifact Analysis

Step 5: Spectrogram Visual Inspection

Step 6: Generate Forensic Report

Key Concepts

Tools & Systems

Common Scenarios

Scenario: Executive Impersonation Wire Transfer Fraud

Session Logs

OpenClaw Test Heap Leaks

Node Connect

Openclaw Qa Testing

Openclaw Secret Scanning Maintainer

Flags