AI Engineering principles and decision-making for ML, DL, RL, and DRL. Framework selection, model architecture, training patterns, evaluation strategies, and deployment. Suitable from beginner to expert level. Use when working with machine learning, deep learning, reinforcement learning, model training, AI deployment, or MLOps tasks.
Systematic approach to building AI systems from research to production.
Choose the right framework based on your needs:
| Need | Framework | Why |
|---|---|---|
| Research & prototyping | PyTorch | Dynamic graphs, pythonic, easy debugging |
| Production at scale | TensorFlow | Mature ecosystem, TF Serving, TFLite |
| High performance | JAX | JIT compilation, functional programming |
| Traditional ML | scikit-learn | Simple API, comprehensive algorithms |
| Quick start | Keras | High-level, beginner-friendly |
| Mobile/Edge |
| TensorFlow Lite |
| Optimized for resource-constrained devices |
For detailed framework comparison, see frameworks.md
Dataset size < 10k rows?
├─ Yes → Traditional ML (scikit-learn)
└─ No → Consider deep learning
Tabular data?
├─ Yes → XGBoost, LightGBM, CatBoost
└─ No (images, text, audio) → Deep learning
Need interpretability?
├─ Yes → Decision trees, linear models
└─ No → Deep learning acceptable
Computational resources limited?
├─ Yes → Traditional ML or small neural networks
└─ No → Large deep learning models
| Data Type | Task | Recommended Architecture |
|---|---|---|
| Images | Classification | ResNet, EffNet, ViT |
| Object Detection | YOLOv8 (speed), Faster R-CNN (accuracy) | |
| Segmentation | U-Net, Mask R-CNN | |
| Text | Classification | BERT, RoBERTa |
| Generation | GPT, T5, BART | |
| Translation | T5, MarianMT | |
| Sequences | Time Series | LSTM, Temporal CNN, Transformer |
| Speech | Wav2Vec 2.0, Whisper | |
| Tabular | Classification/Regression | XGBoost, LightGBM, Neural Networks |
For architecture details and variants, see architectures.md
Data Preparation
Model Selection
Training
Evaluation
Optimization (if needed)
| Method | When to Use |
|---|---|
| Grid Search | Small search space (< 10 combinations) |
| Random Search | Medium space (< 100 combinations) |
| Bayesian Optimization | Expensive training, continuous parameters |
| Population-based | Very large models, parallel resources |
Use RL when:
| Action Space | Sample Efficiency | Algorithm |
|---|---|---|
| Discrete | Low priority | DQN, Rainbow |
| Discrete | High priority | SAC (discrete version) |
| Continuous | Low priority | PPO |
| Continuous | High priority | SAC, TD3 |
| Need stability | - | PPO (most stable) |
For RL/DRL implementation details, see rl-drl.md
| Task | Primary Metrics | When to Use Others |
|---|---|---|
| Binary Classification | F1, AUC-ROC | Precision (false positives matter), Recall (false negatives matter) |
| Multi-class | Macro F1, Accuracy | Per-class F1 (imbalanced), Confusion matrix (error analysis) |
| Regression | MSE, MAE | R² (goodness of fit), MAPE (percentage error) |
| Object Detection | mAP | IoU thresholds, per-class AP |
| RL | Cumulative reward | Episode length, success rate |
Data size < 1000 samples?
├─ Yes → K-fold cross-validation (k=5 or 10)
└─ No → Single train/val/test split
Time series data?
├─ Yes → Time-based splits (no shuffle!)
└─ No → Random or stratified split
Imbalanced classes?
├─ Yes → Stratified split
└─ No → Random split
Where will model run?
├─ Cloud → API serving (TF Serving, TorchServe)
├─ Edge/Mobile → Model compression + TFLite/ONNX
├─ Browser → TensorFlow.js
└─ Batch → Scheduled jobs
Latency requirements?
├─ Real-time (< 100ms) → Optimize model, use caching
├─ Interactive (< 1s) → Standard serving
└─ Batch (minutes/hours) → Batch processing
Scale?
├─ High traffic → Kubernetes + auto-scaling
├─ Medium → Cloud Run, Lambda
└─ Low → Simple API server
Before deployment, consider:
For MLOps and production deployment details, see mlops.md
Use transfer learning when:
Train from scratch when:
Vision:
NLP:
Time Series:
| Problem | Solution |
|---|---|
| Overfitting | Regularization (dropout, L2), more data, simpler model |
| Underfitting | Larger model, more features, less regularization |
| Slow training | Larger batch size, better optimizer (Adam), learning rate tuning |
| Unstable training | Lower learning rate, gradient clipping, batch normalization |
| Poor generalization | Data augmentation, cross-validation, domain adaptation |
| Class imbalance | Class weights, resampling, proper metrics (F1, not accuracy) |
Philosophy: Start simple, measure everything, iterate based on data. The best model is the one that solves the problem with minimum complexity.