Provides FinOps practices for cloud cost optimization, budget management, and resource rightsizing. Use when analyzing cloud spend, optimizing costs, or when user mentions 'finops', 'cloud cost', 'rightsizing', 'budget', 'cost allocation', 'reserved instance', 'spot instance', 'showback', 'chargeback'.
42:T643f,# FinOps Patterns
Best practices for managing cloud financial operations -- visibility, optimization, and governance of cloud spend at scale.
Understanding where cloud money goes is the first step to controlling it.
Total Cloud Spend
|
+-- Compute (60-70%)
| +-- VMs / Instances
| +-- Containers (EKS, GKE, ECS)
| +-- Serverless (Lambda, Cloud Functions)
| +-- GPU / ML Training
|
+-- Storage (15-20%)
| +-- Block (EBS, Persistent Disks)
| +-- Object (S3, GCS, Blob)
| +-- Database (RDS, Cloud SQL, DynamoDB)
|
+-- Network (10-15%)
| +-- Data Transfer (egress is expensive)
| +-- Load Balancers
| +-- CDN
| +-- VPN / Direct Connect
|
+-- Other (5-10%)
+-- Monitoring / Logging
+-- DNS / Certificates
+-- Support Plans
+-- Marketplace
| Cost Driver | Typical Waste | Quick Win |
|---|---|---|
| Idle instances | 20-30% of compute | Schedule dev/test shutdown nights + weekends |
| Over-provisioned instances | 40-60% of instances | Rightsize based on actual CPU/memory usage |
| Unattached storage volumes | 5-10% of storage | Automated cleanup of orphaned EBS/disks |
| Unused Elastic IPs | Small but cumulative | Release unattached IPs |
| Old snapshots | 10-20% of storage | Lifecycle policy with retention limits |
| Cross-region data transfer | 5-15% of network | Co-locate services in same region |
Organizations progress through three phases of FinOps practice maturity.
| Dimension | Crawl | Walk | Run |
|---|---|---|---|
| Visibility | Monthly bill review | Tagged cost allocation by team | Real-time cost dashboards per service |
| Allocation | Single account, no tagging | Cost centers with basic tags | Full showback/chargeback by product line |
| Optimization | Ad-hoc rightsizing | Quarterly review with recommendations | Automated rightsizing and scaling policies |
| Forecasting | None | Spreadsheet-based projections | ML-based anomaly detection and forecasts |
| Governance | No budgets | Account-level budgets | Per-team budgets with automated enforcement |
| Commitment | On-demand only | Some Reserved Instances | Savings Plans + Spot + RI portfolio managed |
| Culture | Central IT pays the bill | Engineering aware of costs | Engineers own cost as a feature metric |
| Tooling | AWS Console only | Cost Explorer + basic reports | FinOps platform (Kubecost, CloudHealth, etc.) |
Tags are the foundation of cost visibility. Without consistent tagging, cost allocation is impossible.
# tagging-policy.yaml -- Enforced via AWS Organizations SCP or Terraform
required_tags:
- key: "team"
description: "Owning team (must match teams registry)"
example: "platform-engineering"
validation: "^[a-z]+-[a-z]+$"
- key: "service"
description: "Service or application name"
example: "payment-api"
validation: "^[a-z]+-[a-z]+(-[a-z]+)?$"
- key: "environment"
description: "Deployment environment"
allowed_values: ["production", "staging", "development", "sandbox"]
- key: "cost-center"
description: "Finance cost center code"
example: "CC-4200"
validation: "^CC-\\d{4}$"
- key: "data-classification"
description: "Data sensitivity level"
allowed_values: ["public", "internal", "confidential", "restricted"]
optional_tags:
- key: "project"
description: "Project or initiative code"
- key: "managed-by"
description: "IaC tool managing this resource"
allowed_values: ["terraform", "pulumi", "cloudformation", "manual"]
- key: "expiry"
description: "Auto-delete date for temporary resources"
format: "YYYY-MM-DD"
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "RequireTagsOnResourceCreation",
"Effect": "Deny",
"Action": [
"ec2:RunInstances",
"rds:CreateDBInstance",
"s3:CreateBucket",
"eks:CreateCluster",
"lambda:CreateFunction"
],
"Resource": "*",
"Condition": {
"Null": {
"aws:RequestTag/team": "true",
"aws:RequestTag/service": "true",
"aws:RequestTag/environment": "true",
"aws:RequestTag/cost-center": "true"
}
}
}
]
}
#!/usr/bin/env python3
"""rightsizing.py -- Analyze EC2 instances for rightsizing opportunities."""
import boto3
from datetime import datetime, timedelta, timezone
from dataclasses import dataclass
UTILIZATION_THRESHOLD = 40 # percent -- instances below this are candidates
LOOKBACK_DAYS = 14
MIN_DATAPOINTS = 100 # Require sufficient data before recommending
@dataclass
class RightsizeRecommendation:
instance_id: str
instance_type: str
avg_cpu: float
max_cpu: float
avg_memory: float # Requires CloudWatch agent
recommended_type: str
monthly_savings: float
confidence: str # "high" | "medium" | "low"
# Instance family downsizing map (simplified)
DOWNSIZE_MAP = {
"m5.2xlarge": "m5.xlarge",
"m5.xlarge": "m5.large",
"m5.large": "m5.medium", # Careful: medium may be too small
"c5.2xlarge": "c5.xlarge",
"c5.xlarge": "c5.large",
"r5.2xlarge": "r5.xlarge",
"r5.xlarge": "r5.large",
"t3.xlarge": "t3.large",
"t3.large": "t3.medium",
}
# Approximate on-demand hourly pricing (us-east-1)
HOURLY_PRICING = {
"m5.2xlarge": 0.384, "m5.xlarge": 0.192, "m5.large": 0.096,
"c5.2xlarge": 0.340, "c5.xlarge": 0.170, "c5.large": 0.085,
"r5.2xlarge": 0.504, "r5.xlarge": 0.252, "r5.large": 0.126,
"t3.xlarge": 0.1664, "t3.large": 0.0832, "t3.medium": 0.0416,
}
def get_cpu_utilization(cw_client, instance_id: str) -> tuple[float, float]:
"""Return (avg_cpu, max_cpu) over the lookback period."""
end = datetime.now(timezone.utc)
start = end - timedelta(days=LOOKBACK_DAYS)
response = cw_client.get_metric_statistics(
Namespace="AWS/EC2",
MetricName="CPUUtilization",
Dimensions=[{"Name": "InstanceId", "Value": instance_id}],
StartTime=start,
EndTime=end,
Period=3600, # 1-hour intervals
Statistics=["Average", "Maximum"],
)
datapoints = response.get("Datapoints", [])
if len(datapoints) < MIN_DATAPOINTS:
return -1.0, -1.0 # Insufficient data
avg = sum(dp["Average"] for dp in datapoints) / len(datapoints)
peak = max(dp["Maximum"] for dp in datapoints)
return avg, peak
def analyze_fleet() -> list[RightsizeRecommendation]:
"""Analyze all running EC2 instances for rightsizing."""
ec2 = boto3.client("ec2")
cw = boto3.client("cloudwatch")
recommendations = []
# Get all running instances
paginator = ec2.get_paginator("describe_instances")
for page in paginator.paginate(
Filters=[{"Name": "instance-state-name", "Values": ["running"]}]
):
for reservation in page["Reservations"]:
for instance in reservation["Instances"]:
instance_id = instance["InstanceId"]
instance_type = instance["InstanceType"]
# Skip instances not in our downsize map
if instance_type not in DOWNSIZE_MAP:
continue
avg_cpu, max_cpu = get_cpu_utilization(cw, instance_id)
if avg_cpu < 0:
continue # Insufficient data
if avg_cpu < UTILIZATION_THRESHOLD:
recommended = DOWNSIZE_MAP[instance_type]
current_cost = HOURLY_PRICING.get(instance_type, 0)
new_cost = HOURLY_PRICING.get(recommended, 0)
monthly_savings = (current_cost - new_cost) * 730
confidence = "high" if max_cpu < 60 else "medium"
recommendations.append(RightsizeRecommendation(
instance_id=instance_id,
instance_type=instance_type,
avg_cpu=round(avg_cpu, 1),
max_cpu=round(max_cpu, 1),
avg_memory=-1.0,
recommended_type=recommended,
monthly_savings=round(monthly_savings, 2),
confidence=confidence,
))
# Sort by savings potential (highest first)
recommendations.sort(key=lambda r: r.monthly_savings, reverse=True)
return recommendations
if __name__ == "__main__":
recs = analyze_fleet()
total_savings = sum(r.monthly_savings for r in recs)
print(f"\n{'='*80}")
print(f"RIGHTSIZING RECOMMENDATIONS ({len(recs)} instances)")
print(f"Potential monthly savings: ${total_savings:,.2f}")
print(f"{'='*80}\n")
for r in recs:
print(f" {r.instance_id}: {r.instance_type} -> {r.recommended_type}")
print(f" CPU avg={r.avg_cpu}% max={r.max_cpu}%")
print(f" Savings: ${r.monthly_savings}/mo Confidence: {r.confidence}")
print()
# cloudformation/budget-alerts.yaml
AWSTemplateFormatVersion: "2010-09-09"
Description: FinOps budget alerts with multi-threshold notifications
Parameters:
TeamName:
Type: String
Description: Team name for cost allocation
MonthlyBudget:
Type: Number
Description: Monthly budget in USD
AlertEmail:
Type: String
Description: Email for budget notifications
SlackWebhookArn:
Type: String
Description: ARN of Lambda that posts to Slack
Resources:
# SNS topic for budget alerts
BudgetAlertTopic:
Type: AWS::SNS::Topic
Properties:
TopicName: !Sub "${TeamName}-budget-alerts"
Subscription:
- Protocol: email
Endpoint: !Ref AlertEmail
- Protocol: lambda
Endpoint: !Ref SlackWebhookArn
# Monthly budget with progressive thresholds
TeamBudget:
Type: AWS::Budgets::Budget
Properties:
Budget:
BudgetName: !Sub "${TeamName}-monthly-budget"
BudgetLimit:
Amount: !Ref MonthlyBudget
Unit: USD
TimeUnit: MONTHLY
BudgetType: COST
CostFilters:
TagKeyValue:
- !Sub "user:team$${TeamName}"
NotificationsWithSubscribers:
# 50% threshold -- informational
- Notification:
NotificationType: ACTUAL
ComparisonOperator: GREATER_THAN
Threshold: 50
ThresholdType: PERCENTAGE
Subscribers:
- SubscriptionType: SNS
Address: !Ref BudgetAlertTopic
# 80% threshold -- warning
- Notification:
NotificationType: ACTUAL
ComparisonOperator: GREATER_THAN
Threshold: 80
ThresholdType: PERCENTAGE
Subscribers:
- SubscriptionType: SNS
Address: !Ref BudgetAlertTopic
# 100% threshold -- critical
- Notification:
NotificationType: ACTUAL
ComparisonOperator: GREATER_THAN
Threshold: 100
ThresholdType: PERCENTAGE
Subscribers:
- SubscriptionType: SNS
Address: !Ref BudgetAlertTopic
# Forecasted to exceed -- early warning
- Notification:
NotificationType: FORECASTED
ComparisonOperator: GREATER_THAN
Threshold: 100
ThresholdType: PERCENTAGE
Subscribers:
- SubscriptionType: SNS
Address: !Ref BudgetAlertTopic
# CloudWatch alarm for sudden spend spikes (daily granularity)
DailySpendAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${TeamName}-daily-spend-spike"
AlarmDescription: "Daily spend exceeds expected daily average by 2x"
Namespace: AWS/Billing
MetricName: EstimatedCharges
Dimensions:
- Name: Currency
Value: USD
Statistic: Maximum
Period: 86400 # 24 hours
EvaluationPeriods: 1
# Threshold = (monthly budget / 30 days) * 2 (spike factor)
Threshold: !Sub "${AWS::NoValue}"
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- !Ref BudgetAlertTopic
# kubernetes/spot-node-pool.yaml -- EKS managed node group with spot
apiVersion: eksctl.io/v1alpha5