When the user wants to analyze procurement spend, identify savings opportunities, or classify expenditures. Also use when the user mentions "spend analytics," "spend visibility," "spend cube," "category analysis," "Pareto analysis," "tail spend," "maverick spend," "spend classification," or "savings opportunity." For executing sourcing strategies, see strategic-sourcing. For supplier evaluation, see supplier-selection.
You are an expert in procurement spend analysis and analytics. Your goal is to help organizations understand their spending patterns, identify savings opportunities, improve compliance, and enable data-driven procurement decisions through comprehensive spend visibility and analysis.
Before conducting spend analysis, understand:
Analysis Objectives
Data Availability
Current State
Scope & Resources
Three Dimensions:
Analysis Types:
Primary Sources:
Secondary Sources:
import pandas as pd
import numpy as np
def extract_spend_data(data_sources):
"""
Extract and consolidate spend data from multiple sources
data_sources: dict with {source_name: file_path or dataframe}
"""
all_data = []
for source, data in data_sources.items():
if isinstance(data, str):
# Load from file
if data.endswith('.csv'):
df = pd.read_csv(data)
elif data.endswith('.xlsx'):
df = pd.read_excel(data)
else:
df = data.copy()
# Add source column
df['data_source'] = source
# Standardize column names
column_mapping = {
'vendor': 'supplier_name',
'vendor_name': 'supplier_name',
'supplier': 'supplier_name',
'amount': 'spend_amount',
'total': 'spend_amount',
'date': 'transaction_date',
'invoice_date': 'transaction_date',
'payment_date': 'transaction_date',
}
df = df.rename(columns={
k: v for k, v in column_mapping.items() if k in df.columns
})
all_data.append(df)
# Concatenate all sources
consolidated = pd.concat(all_data, ignore_index=True, sort=False)
return consolidated
def clean_spend_data(df):
"""
Clean and standardize spend data
Returns: cleaned DataFrame
"""
df = df.copy()
# Remove duplicates
initial_rows = len(df)
df = df.drop_duplicates(subset=['supplier_name', 'transaction_date', 'spend_amount'],
keep='first')
duplicates_removed = initial_rows - len(df)
# Standardize supplier names
df['supplier_name'] = df['supplier_name'].str.strip().str.upper()
df['supplier_name'] = df['supplier_name'].str.replace(r'\s+', ' ', regex=True)
# Handle common variations
df['supplier_name'] = df['supplier_name'].replace({
r'.*\bINC\.?$': 'INC',
r'.*\bLLC\.?$': 'LLC',
r'.*\bCORP\.?$': 'CORP',
r'.*\bLTD\.?$': 'LTD',
}, regex=True)
# Ensure numeric spend
df['spend_amount'] = pd.to_numeric(df['spend_amount'], errors='coerce')
# Remove negative amounts (credits handled separately)
df = df[df['spend_amount'] > 0]
# Convert dates
df['transaction_date'] = pd.to_datetime(df['transaction_date'], errors='coerce')
# Extract year and month
df['year'] = df['transaction_date'].dt.year
df['month'] = df['transaction_date'].dt.month
df['quarter'] = df['transaction_date'].dt.quarter
# Remove rows with missing critical fields
df = df.dropna(subset=['supplier_name', 'spend_amount', 'transaction_date'])
print(f"Data Cleaning Summary:")
print(f" Duplicates removed: {duplicates_removed:,}")
print(f" Final records: {len(df):,}")
print(f" Total spend: ${df['spend_amount'].sum():,.2f}")
return df
def enrich_spend_data(df, supplier_mapping=None, category_mapping=None):
"""
Enrich spend data with classifications
supplier_mapping: dict or DataFrame with supplier standardization
category_mapping: dict or DataFrame with category assignments
"""
df = df.copy()
# Supplier normalization
if supplier_mapping is not None:
if isinstance(supplier_mapping, dict):
df['supplier_normalized'] = df['supplier_name'].map(supplier_mapping)
else:
df = df.merge(
supplier_mapping[['supplier_name', 'supplier_normalized']],
on='supplier_name',
how='left'
)
# Use normalized name if available, otherwise original
df['supplier_normalized'] = df['supplier_normalized'].fillna(df['supplier_name'])
else:
df['supplier_normalized'] = df['supplier_name']
# Category classification
if category_mapping is not None:
if isinstance(category_mapping, dict):
df['category'] = df['supplier_normalized'].map(category_mapping)
else:
df = df.merge(
category_mapping[['supplier_normalized', 'category']],
on='supplier_normalized',
how='left'
)
# Flag unclassified spend
df['category'] = df['category'].fillna('Unclassified')
# Additional enrichment
# Spend tier based on amount
df['spend_tier'] = pd.cut(
df['spend_amount'],
bins=[0, 1000, 10000, 100000, float('inf')],
labels=['<$1K', '$1K-$10K', '$10K-$100K', '>$100K']
)
return df
Common Category Hierarchy:
Level 1: Major Category
Level 2: Sub-Category
Level 3: Commodity
Level 4: Item/SKU
Examples:
- Direct Materials
- Raw Materials
- Steel
- Cold-rolled steel
- Hot-rolled steel
- Components
- Electronics
- Mechanical parts
- Indirect Materials
- MRO (Maintenance, Repair, Operations)
- Tools
- Safety equipment
- Facilities
- Utilities
- Cleaning supplies
- Services
- Professional Services
- Consulting
- Legal
- IT Services
- Software licenses
- Managed services
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
class SpendClassifier:
"""Machine learning-based spend classification"""
def __init__(self):
self.model = Pipeline([
('tfidf', TfidfVectorizer(max_features=500, ngram_range=(1, 2))),
('classifier', MultinomialNB())
])
self.trained = False
def train(self, training_data):
"""
Train classifier on labeled data
training_data: DataFrame with 'description' and 'category' columns
"""
X = training_data['description']
y = training_data['category']
self.model.fit(X, y)
self.trained = True
# Calculate accuracy
train_accuracy = self.model.score(X, y)
print(f"Training accuracy: {train_accuracy:.2%}")
def predict(self, descriptions):
"""Predict categories for new descriptions"""
if not self.trained:
raise ValueError("Model must be trained first")
predictions = self.model.predict(descriptions)
probabilities = self.model.predict_proba(descriptions)
return pd.DataFrame({
'description': descriptions,
'predicted_category': predictions,
'confidence': probabilities.max(axis=1)
})
def classify_spend_data(self, df, description_column='description',
confidence_threshold=0.7):
"""
Classify entire spend dataset
Returns: DataFrame with predicted categories
"""
if not self.trained:
raise ValueError("Model must be trained first")
predictions = self.predict(df[description_column])
# Add to original dataframe
result = df.copy()
result['predicted_category'] = predictions['predicted_category']
result['classification_confidence'] = predictions['confidence']
# Flag low confidence
result['needs_review'] = result['classification_confidence'] < confidence_threshold
return result
# Example usage
# Assume we have some labeled training data
training_data = pd.DataFrame({
'description': [
'office supplies paper pens',
'laptop computer IT hardware',
'consulting services strategic',
'steel raw material manufacturing',
'janitorial cleaning service',
'software license subscription',
],
'category': [
'Office Supplies',
'IT Hardware',
'Professional Services',
'Raw Materials',
'Facilities',
'IT Software',
]
})
classifier = SpendClassifier()
classifier.train(training_data)
# Classify new spend
new_spend = pd.DataFrame({
'description': [
'desktop computers monitors',
'office furniture desks chairs',
'legal advisory services'
]
})
predictions = classifier.predict(new_spend['description'])
print(predictions)
def pareto_analysis(df, group_by='supplier_normalized', spend_column='spend_amount'):
"""
Perform Pareto analysis to identify top contributors
Returns: DataFrame with cumulative spend percentages
"""
# Aggregate spend
spend_summary = df.groupby(group_by)[spend_column].sum().reset_index()
spend_summary = spend_summary.sort_values(spend_column, ascending=False)
# Calculate percentages
total_spend = spend_summary[spend_column].sum()
spend_summary['spend_pct'] = spend_summary[spend_column] / total_spend * 100
spend_summary['cumulative_pct'] = spend_summary['spend_pct'].cumsum()
# Classify into A, B, C
spend_summary['abc_class'] = 'C'
spend_summary.loc[spend_summary['cumulative_pct'] <= 80, 'abc_class'] = 'A'
spend_summary.loc[
(spend_summary['cumulative_pct'] > 80) &
(spend_summary['cumulative_pct'] <= 95),
'abc_class'
] = 'B'
# Add rank
spend_summary['rank'] = range(1, len(spend_summary) + 1)
return spend_summary
# Example with visualization
import matplotlib.pyplot as plt
def plot_pareto(df, group_by='supplier_normalized', top_n=20):
"""Create Pareto chart"""
pareto_df = pareto_analysis(df, group_by=group_by)
# Take top N
plot_data = pareto_df.head(top_n)
fig, ax1 = plt.subplots(figsize=(14, 6))
# Bar chart
x = range(len(plot_data))
ax1.bar(x, plot_data['spend_pct'], color='steelblue', alpha=0.7)
ax1.set_xlabel(group_by.replace('_', ' ').title())
ax1.set_ylabel('% of Total Spend', color='steelblue')
ax1.tick_params(axis='y', labelcolor='steelblue')
ax1.set_xticks(x)
ax1.set_xticklabels(plot_data[group_by], rotation=45, ha='right')
# Cumulative line
ax2 = ax1.twinx()
ax2.plot(x, plot_data['cumulative_pct'], color='red', marker='o', linewidth=2)
ax2.set_ylabel('Cumulative %', color='red')
ax2.tick_params(axis='y', labelcolor='red')
ax2.set_ylim(0, 105)
ax2.axhline(y=80, color='gray', linestyle='--', alpha=0.5)
ax2.text(len(plot_data)-1, 82, '80%', color='gray')
plt.title(f'Pareto Analysis: Top {top_n} {group_by.replace("_", " ").title()}')
plt.tight_layout()
return fig, pareto_df
# Usage
# fig, pareto_summary = plot_pareto(spend_df, group_by='supplier_normalized')
# plt.show()
def supplier_consolidation_opportunity(df, category='category'):
"""
Identify opportunities for supplier consolidation
Returns: DataFrame with consolidation opportunities by category
"""
consolidation_opportunities = []
for cat in df[category].unique():
cat_data = df[df[category] == cat]
num_suppliers = cat_data['supplier_normalized'].nunique()
total_spend = cat_data['spend_amount'].sum()
num_transactions = len(cat_data)
# Get top supplier
top_supplier = cat_data.groupby('supplier_normalized')['spend_amount'].sum().idxmax()
top_supplier_spend = cat_data[cat_data['supplier_normalized'] == top_supplier]['spend_amount'].sum()
top_supplier_pct = top_supplier_spend / total_spend * 100
# Tail spend (suppliers with <2% of category spend)
supplier_spend = cat_data.groupby('supplier_normalized')['spend_amount'].sum()
tail_suppliers = (supplier_spend / total_spend < 0.02).sum()
tail_spend = supplier_spend[supplier_spend / total_spend < 0.02].sum()
# Opportunity score (more suppliers + more tail spend = higher opportunity)
opportunity_score = (
(num_suppliers / 10) * 40 + # Supplier count (normalized)
(tail_spend / total_spend) * 60 # Tail spend %
)
consolidation_opportunities.append({
'category': cat,
'total_spend': total_spend,
'num_suppliers': num_suppliers,
'top_supplier': top_supplier,
'top_supplier_pct': top_supplier_pct,
'tail_suppliers': tail_suppliers,
'tail_spend': tail_spend,
'tail_spend_pct': tail_spend / total_spend * 100,
'opportunity_score': min(100, opportunity_score)
})
result = pd.DataFrame(consolidation_opportunities)
result = result.sort_values('opportunity_score', ascending=False)
return result
# Example
# consolidation_opps = supplier_consolidation_opportunity(spend_df)
# print(consolidation_opps.head(10))
Maverick Spend = Off-contract or non-preferred supplier purchases
def detect_maverick_spend(df, preferred_suppliers, contracts):
"""
Identify maverick (off-contract) spend
preferred_suppliers: list of approved suppliers
contracts: DataFrame with contract details
"""
df = df.copy()
# Flag non-preferred suppliers
df['is_preferred_supplier'] = df['supplier_normalized'].isin(preferred_suppliers)
# Match to contracts
if contracts is not None:
df = df.merge(
contracts[['supplier_normalized', 'category', 'contract_id']],
on=['supplier_normalized', 'category'],
how='left'
)
df['has_contract'] = df['contract_id'].notna()
else:
df['has_contract'] = False
# Classify spend
df['spend_type'] = 'Maverick'
df.loc[df['is_preferred_supplier'] & df['has_contract'], 'spend_type'] = 'On-Contract'
df.loc[df['is_preferred_supplier'] & ~df['has_contract'], 'spend_type'] = 'Preferred (No Contract)'
# Summarize
maverick_summary = df.groupby('spend_type')['spend_amount'].agg([
('total_spend', 'sum'),
('num_transactions', 'count')
]).reset_index()
total = df['spend_amount'].sum()
maverick_summary['pct_of_total'] = maverick_summary['total_spend'] / total * 100
return df, maverick_summary
# Example
preferred_suppliers = ['SUPPLIER A INC', 'SUPPLIER B LLC', 'SUPPLIER C CORP']
contracts = pd.DataFrame({
'supplier_normalized': ['SUPPLIER A INC', 'SUPPLIER B LLC'],
'category': ['Office Supplies', 'IT Hardware'],
'contract_id': ['CNT-001', 'CNT-002']
})
spend_with_flags, maverick_summary = detect_maverick_spend(
spend_df, preferred_suppliers, contracts
)
print("\nMaverick Spend Summary:")
print(maverick_summary)
def price_variance_analysis(df, item_column='item_description',
price_column='unit_price'):
"""
Analyze price variance for same items across transactions
Identifies opportunities for price standardization
"""
# Group by item
price_stats = df.groupby(item_column)[price_column].agg([
('min_price', 'min'),
('max_price', 'max'),
('avg_price', 'mean'),
('std_price', 'std'),
('num_purchases', 'count')
]).reset_index()
# Calculate variance
price_stats['price_variance'] = price_stats['max_price'] - price_stats['min_price']
price_stats['variance_pct'] = (
price_stats['price_variance'] / price_stats['avg_price'] * 100
)
# Calculate potential savings
# If all purchases were at min price
item_spend = df.groupby(item_column)['spend_amount'].sum()
price_stats = price_stats.merge(
item_spend.rename('total_spend'),
left_on=item_column,
right_index=True
)
price_stats['potential_savings'] = (
price_stats['total_spend'] *
(1 - price_stats['min_price'] / price_stats['avg_price'])
)
# Sort by savings opportunity
price_stats = price_stats.sort_values('potential_savings', ascending=False)
# Filter significant variances
price_stats = price_stats[
(price_stats['variance_pct'] > 10) &
(price_stats['num_purchases'] >= 3)
]
return price_stats
1. Price Reduction
2. Demand Management
3. Process Improvement
4. Supplier Optimization
class SavingsOpportunityCalculator:
"""Calculate and prioritize savings opportunities"""
def __init__(self, current_spend):
self.current_spend = current_spend
self.opportunities = []
def add_price_reduction(self, category, baseline_spend,
current_price, target_price, confidence=0.8):
"""Add price reduction opportunity"""
savings_pct = (current_price - target_price) / current_price
annual_savings = baseline_spend * savings_pct
risk_adjusted_savings = annual_savings * confidence
self.opportunities.append({
'category': category,
'type': 'Price Reduction',
'baseline_spend': baseline_spend,
'savings_pct': savings_pct * 100,
'gross_savings': annual_savings,
'confidence': confidence * 100,
'risk_adjusted_savings': risk_adjusted_savings,
'implementation_effort': 'Medium',
'timeline_months': 3
})
def add_consolidation(self, category, baseline_spend,
current_suppliers, target_suppliers,
expected_discount=0.05, confidence=0.7):
"""Add supplier consolidation opportunity"""
annual_savings = baseline_spend * expected_discount
risk_adjusted_savings = annual_savings * confidence
self.opportunities.append({
'category': category,
'type': 'Supplier Consolidation',
'baseline_spend': baseline_spend,
'savings_pct': expected_discount * 100,
'gross_savings': annual_savings,
'confidence': confidence * 100,
'risk_adjusted_savings': risk_adjusted_savings,
'implementation_effort': 'High',
'timeline_months': 6,
'details': f'Reduce from {current_suppliers} to {target_suppliers} suppliers'
})
def add_demand_reduction(self, category, baseline_spend,
reduction_pct, confidence=0.6):
"""Add demand/usage reduction opportunity"""
annual_savings = baseline_spend * reduction_pct
risk_adjusted_savings = annual_savings * confidence
self.opportunities.append({
'category': category,
'type': 'Demand Reduction',
'baseline_spend': baseline_spend,
'savings_pct': reduction_pct * 100,
'gross_savings': annual_savings,
'confidence': confidence * 100,
'risk_adjusted_savings': risk_adjusted_savings,
'implementation_effort': 'High',
'timeline_months': 12
})
def add_process_improvement(self, category, baseline_spend,
process_cost_reduction, confidence=0.9):
"""Add process improvement opportunity"""
annual_savings = process_cost_reduction
risk_adjusted_savings = annual_savings * confidence
self.opportunities.append({
'category': category,
'type': 'Process Improvement',
'baseline_spend': baseline_spend,
'savings_pct': (annual_savings / baseline_spend) * 100,
'gross_savings': annual_savings,
'confidence': confidence * 100,
'risk_adjusted_savings': risk_adjusted_savings,
'implementation_effort': 'Medium',
'timeline_months': 6
})
def get_summary(self):
"""Get prioritized savings opportunities"""
if not self.opportunities:
return None
df = pd.DataFrame(self.opportunities)
df = df.sort_values('risk_adjusted_savings', ascending=False)
# Add cumulative
df['cumulative_savings'] = df['risk_adjusted_savings'].cumsum()
total_gross = df['gross_savings'].sum()
total_risk_adjusted = df['risk_adjusted_savings'].sum()
summary = {
'total_opportunities': len(df),
'total_gross_savings': total_gross,
'total_risk_adjusted_savings': total_risk_adjusted,
'savings_pct_of_spend': total_risk_adjusted / self.current_spend * 100,
'opportunities': df
}
return summary
# Example usage
calculator = SavingsOpportunityCalculator(current_spend=10000000)
calculator.add_price_reduction(
category='Office Supplies',
baseline_spend=500000,
current_price=10.0,
target_price=9.0,
confidence=0.85
)
calculator.add_consolidation(
category='IT Hardware',
baseline_spend=1200000,
current_suppliers=8,
target_suppliers=3,
expected_discount=0.08,
confidence=0.75
)
calculator.add_demand_reduction(
category='Travel',
baseline_spend=800000,
reduction_pct=0.15,
confidence=0.60
)
calculator.add_process_improvement(
category='All Categories',
baseline_spend=10000000,
process_cost_reduction=100000,
confidence=0.90
)
summary = calculator.get_summary()
print(f"Total Risk-Adjusted Savings: ${summary['total_risk_adjusted_savings']:,.0f}")
print(f"Savings % of Spend: {summary['savings_pct_of_spend']:.1f}%")
print(f"\nTop Opportunities:")
print(summary['opportunities'][['category', 'type', 'risk_adjusted_savings']].head())
Data Analysis:
pandas: Data manipulation and analysisnumpy: Numerical computationsscipy: Statistical analysisMachine Learning:
scikit-learn: Classification and clusteringfuzzywuzzy: Fuzzy string matching for supplier namesspacy, nltk: Natural language processingVisualization:
matplotlib, seaborn: Charts and plotsplotly: Interactive dashboardsdash: Web-based analytics appsSpend Analytics Platforms:
Business Intelligence:
Problem:
Solutions:
Problem:
Solutions:
Problem:
Solutions:
Problem:
Solutions:
Problem:
Solutions:
Executive Summary:
Spend Overview:
| Metric | Value | % of Total |
|---|---|---|
| Total Spend | $12.5M | 100% |
| Managed Spend (under contract) | $8.2M | 66% |
| Unmanaged Spend | $4.3M | 34% |
| Number of Suppliers | 1,247 | - |
| Number of Transactions | 45,320 | - |
Spend by Category:
| Category | Spend | % of Total | Suppliers | Avg Transaction | Opportunity |
|---|---|---|---|---|---|
| IT Hardware | $2.1M | 17% | 15 | $12,350 | High |
| Professional Services | $1.8M | 14% | 45 | $8,200 | Medium |
| Office Supplies | $1.2M | 10% | 120 | $450 | High |
| Raw Materials | $3.4M | 27% | 25 | $28,500 | Low |
| MRO | $0.9M | 7% | 200 | $380 | High |
| Other | $3.1M | 25% | 842 | $1,100 | Medium |
Pareto Analysis:
Top 20 Suppliers (1.6% of supplier base):
- Account for 68% of total spend ($8.5M)
- Average spend per supplier: $425K
Next 80 Suppliers (6.4%):
- Account for 22% of spend ($2.8M)
- Average spend: $35K
Tail 1,147 Suppliers (92%):
- Account for 10% of spend ($1.2M)
- Average spend: $1,050
- Consolidation opportunity: Reduce to ~400 suppliers
Supplier Concentration:
| Risk Level | # Suppliers | Spend | % of Total |
|---|---|---|---|
| Critical (>$500K) | 8 | $5.2M | 42% |
| High ($100K-$500K) | 28 | $3.8M | 30% |
| Medium ($10K-$100K) | 186 | $2.3M | 18% |
| Low (<$10K) | 1,025 | $1.2M | 10% |
Savings Opportunities:
| Opportunity | Category | Baseline Spend | Potential Savings | Confidence | Priority |
|---|---|---|---|---|---|
| Consolidate tail suppliers | Office Supplies | $1.2M | $120K (10%) | 80% | High |
| Competitive bid | IT Hardware | $2.1M | $168K (8%) | 85% | High |
| Standardize specs | MRO | $0.9M | $72K (8%) | 70% | Medium |
| Contract compliance | Prof Services | $1.8M | $90K (5%) | 90% | High |
| Price benchmarking | Raw Materials | $3.4M | $102K (3%) | 75% | Medium |
| Total | - | $9.4M | $552K (5.9%) | - | - |
Compliance & Risk:
Maverick Spend: $1.8M (14% of total)
Single-Source Risk: 12 suppliers (Critical dependencies)
Price Variance: $245K potential savings
Recommended Actions:
Immediate (0-3 months)
Near-term (3-6 months)
Long-term (6-12 months)
If you need more context: