Analyzes technical systems and problems through engineering lens using first principles, systems thinking,
design methodologies, and optimization frameworks.
Provides insights on feasibility, performance, reliability, scalability, and trade-offs.
Use when: System design, technical feasibility, optimization, failure analysis, performance issues.
Evaluates: Requirements, constraints, trade-offs, efficiency, robustness, maintainability.
Analyze technical systems, problems, and designs through the disciplinary lens of engineering, applying established frameworks (systems engineering, design thinking, optimization theory), multiple methodological approaches (first principles analysis, failure mode analysis, design of experiments), and evidence-based practices to understand how systems work, why they fail, and how to design reliable, efficient, and scalable solutions.
When to Use This Skill
System Design: Architect new systems, subsystems, or components with clear requirements
Technical Feasibility: Assess whether proposed solutions are technically viable
Performance Optimization: Improve speed, efficiency, throughput, or resource utilization
Failure Analysis: Diagnose why systems fail and prevent recurrence
Trade-off Analysis: Evaluate competing design options with multiple constraints
関連 Skill
Scalability Assessment: Determine whether systems can grow to meet future demands
Requirements Engineering: Clarify, decompose, and validate technical requirements
Reliability Engineering: Design for high availability, fault tolerance, and resilience
Core Philosophy: Engineering Thinking
Engineering analysis rests on several fundamental principles:
First Principles Reasoning: Break complex problems down to fundamental truths and reason up from there. Don't rely on analogy or convention when fundamentals matter.
Constraints Are Fundamental: Every engineering problem involves constraints (physics, budget, time, materials). Design happens within constraints, not despite them.
Trade-offs Are Inevitable: No design optimizes everything. Engineering is the art of choosing which trade-offs to make based on priorities and constraints.
Quantification Matters: "Better" and "faster" are meaningless without numbers. Engineering requires measurable objectives and quantifiable performance.
Systems Thinking: Components interact in complex ways. Local optimization can harm global performance. Always consider the whole system.
Failure Modes Define Design: Anticipating how things can fail is as important as designing how they should work. Robust systems account for failure modes explicitly.
Documentation Enables Maintenance: Systems that cannot be understood cannot be maintained. Clear documentation is engineering deliverable, not afterthought.
Theoretical Foundations (Expandable)
Foundation 1: First Principles Analysis
Core Principles:
Break problems down to fundamental physical laws, constraints, and truths
Reason up from foundations rather than by analogy or precedent
Question assumptions and conventional wisdom
Rebuild understanding from ground up
Identify true constraints vs. artificial limitations
Key Insights:
Analogies can mislead when contexts differ fundamentally
Conventional approaches may be path-dependent, not optimal
True constraints (physics, mathematics) vs. historical constraints (how things have been done)
First principles enable breakthrough innovations by questioning inherited assumptions
Computational limits, thermodynamic limits, information-theoretic limits are real boundaries
Famous Practitioner: Elon Musk
Approach: "Boil things down to their fundamental truths and reason up from there"
Example: Rocket cost analysis - question inherited aerospace pricing assumptions, rebuild from material costs
Pros: Simple deployment, easier debugging, no network latency between modules, single database transactions
Cons: All-or-nothing deploys, scaling requires scaling entire app, merge conflicts increase with team size, technology lock-in
Microservices Characteristics:
Pros: Independent deployment and scaling, technology flexibility, team autonomy, fault isolation
Cons: Distributed system complexity (eventual consistency, partial failures), operational overhead (more services to monitor), network latency, more difficult debugging
Trade-off Analysis:
Criterion
Monolith
Microservices
Weight
Score M
Score MS
Dev Velocity (small team)
High
Low
0.3
9
4
Dev Velocity (large team)
Low
High
0.25
4
8
Scalability
Poor
Excellent
0.2
3
9
Operational Complexity
Low
High
0.15
8
3
Reliability
Medium
Medium
0.1
6
6
Weighted Score (today)
6.75
5.5
Weighted Score (2 yrs)
5.35
6.85
First Principles Analysis:
Conway's Law: System structure mirrors communication structure
Network calls are orders of magnitude slower than in-process calls
Distributed transactions are hard; eventual consistency is complex but scales
Coordination overhead grows with team size
Recommendation:
Stay monolith short-term (next 6-12 months)
Prepare for transition:
Enforce module boundaries within monolith
Design for async communication patterns
Build monitoring and observability infrastructure
Document domain boundaries
Extract strategically (12-24 months):
Start with independently scalable components (e.g., image processing)
Keep core business logic together initially
Avoid premature decomposition
Criteria for extraction: Extract when (a) clear domain boundary, (b) different scaling needs, (c) team wants autonomy, (d) release independence valuable
Key Insight: Microservices are optimization for organizational scaling, not just technical scaling. Premature microservices slow small teams; delayed microservices bottleneck large teams.
Larger (20 MB) but essential for search functionality
Deferred:
Multi-attribute filter queries (10% traffic) - acceptable to be slower
Can add later if specific combinations prove common
Optimization Strategy:
Add indexes 1 and 2 immediately (biggest impact)
Monitor query performance for 1 week
Add full-text index if search traffic grows
Use query explain plans to verify index usage
Expected Results:
Category + price: 2.3s → 0.05s (46x faster)
Brand + availability: 1.8s → 0.04s (45x faster)
Write throughput: -10% (acceptable trade-off)
Storage overhead: +8 MB (+0.8%)
Validation:
Load test with production traffic distribution
Monitor p95/p99 latencies, not just averages
Set up alerting for slow queries
Key Insight: Index design requires understanding query patterns from actual usage, not guessing. Composite indexes are powerful but order matters. Write amplification means you can't index everything.
Why didn't tests catch it?
→ Test database had new schema; production had old schema
Why did schema differ?
→ Migration ran immediately on deploy; gradual rollout not possible
Why couldn't we roll back?
→ Migration was irreversible (dropped column); no rollback procedure tested
Root Causes Identified:
Tight coupling: Code deploy coupled to database migration
Test environment drift: Test database not representative of production
Irreversible migration: No rollback plan
Slow detection: 30 minutes to page engineer
Insufficient monitoring: Error rates not broken down by service
Failure Mode Analysis:
Contributing Factors:
Process: No staged rollout (deployed to 100% immediately)
Technology: No feature flags to disable problematic code path
People: Deployment at 2am with minimal staffing
Monitoring: Alerts tuned too high (12% errors before alerting)
Single Points of Failure:
Single payment processing service (no fallback)
Database schema migration in critical path
One on-call engineer (no backup)
Recommended Mitigations:
Immediate (1 week):
Decouple migrations: Separate schema changes from code deploys
Deploy backward-compatible schema first
Deploy code using new schema
Remove old schema in later migration (if needed)
Canary deployments: Deploy to 5% of traffic, monitor 30min, proceed gradually
Automated rollback if error rate threshold exceeded
Feature flags: Wrap new code paths in flags for instant disable
Alert tuning: Page at 5% error rate increase, not 12%
Medium-term (1 month): 5. Chaos engineering: Regularly test failure scenarios in staging
Rollback procedures tested weekly
Database restoration drills
Improved monitoring:
Service-level dashboards
Distributed tracing for request flows
Synthetic monitoring of critical paths
Runbooks: Document response procedures for common incidents
Long-term (3 months): 8. Circuit breakers: Graceful degradation when downstream services fail 9. Multi-region redundancy: Failover capability for major outages 10. Blameless post-mortems: Culture of learning from failures
FMEA Re-assessment:
Failure Mode
Severity
Occurrence (Before)
Detection (Before)
RPN (Before)
Occurrence (After)
Detection (After)
RPN (After)
Incompatible code/schema
9
6
5
270
2
2
36
Failed rollback
10
7
8
560
3
2
60
Key Insight: Most outages result from combinations of small failures, not single catastrophic errors. Defense in depth (staged rollout, feature flags, decoupled migrations, fast detection) prevents cascading failures. Practicing failure scenarios is as important as preventing them.
When using the engineer-analyst skill, follow this systematic 9-step process:
Step 1: Clarify Requirements and Constraints
What is the technical objective? (Performance? Reliability? Cost? Scale?)
What are hard constraints? (Physics, budget, timeline, compatibility)
What are priorities when trade-offs inevitable?
Step 2: Gather System Context
How does current system work? (Architecture, technologies, interfaces)
What are usage patterns? (Load profiles, user behaviors, edge cases)
What are existing performance characteristics and bottlenecks?
Step 3: First Principles Analysis
Break problem down to fundamental truths
Question assumptions and conventional approaches
Identify true constraints vs. inherited limitations
Calculate theoretical limits where applicable
Step 4: Enumerate Alternatives
What design options exist?
Include status quo as baseline for comparison
Consider both incremental improvements and radical redesigns
Note which alternatives violate hard constraints (discard those)
Step 5: Model and Estimate
Quantify expected performance of alternatives
Use back-of-envelope calculations, queueing theory, prototypes
Identify uncertainties and sensitivity to assumptions
Build simplified models before complex simulations
Step 6: Trade-off Analysis
Score alternatives against multiple objectives
Identify Pareto-optimal designs
Assess sensitivity to priorities (what if weights change?)
Consider robustness vs. optimality trade-off
Step 7: Failure Mode Analysis
How can each alternative fail?
What are consequences of failures?
Can failures be detected quickly?
What mitigation strategies exist?
Step 8: Prototype and Validate
Build minimal prototypes to test key assumptions
Measure actual performance (don't rely solely on estimates)
Validate with realistic data and usage patterns
Iterate based on learnings
Step 9: Document and Communicate
State recommendation with clear justification
Present trade-offs transparently
Document assumptions and sensitivities
Provide fallback options if recommendation proves infeasible
Quality Standards
A thorough engineering analysis includes:
✓ Clear requirements: Objectives, constraints, and priorities specified quantitatively
✓ Baseline measurements: Current system performance documented with numbers
✓ Multiple alternatives: At least 3 options considered, including status quo
✓ Quantified estimates: Performance, cost, and reliability estimated numerically
✓ Trade-off analysis: Multi-objective scoring with explicit priorities
✓ Failure analysis: FMEA or similar systematic failure mode identification
✓ Validation plan: How will we verify design meets requirements?
✓ Assumptions documented: Sensitivities to key assumptions noted
✓ Scalability considered: Will design work at 10x scale?
✓ Maintainability assessed: Can others understand and modify this design?
Common Pitfalls to Avoid
Premature optimization: Optimizing before measuring creates complexity without benefit. Measure first, optimize bottlenecks.
Over-engineering: Designing for scale you'll never reach wastes resources. Start simple, scale when needed.
Under-engineering: Ignoring known future requirements creates costly rewrites. Balance current simplicity with anticipated needs.
Analysis paralysis: Endless analysis without building delays learning. Prototype early to validate assumptions.
Not invented here: Rejecting existing solutions in favor of custom builds. Prefer boring proven technology.
Resume-driven development: Choosing technologies for career benefit rather than project fit. Choose right tool for job.
Ignoring operational costs: Focusing on development cost while ignoring ongoing infrastructure, maintenance, and support costs.
Cargo culting: Copying approaches without understanding context. What works for Google may not work for your startup.
Assuming zero failure rate: All systems fail. Design for graceful degradation, not perfection.
Ignoring human factors: Systems will be operated by humans. Design for usability and operability, not just technical elegance.