Use when porting CUDA-first or Linux-first code to Apple Silicon MLX/Metal. Covers Triton/CUDA kernel porting to MLX Metal (sparse ops, atomics, custom VJP, mx.fast.metal_kernel), PyTorch nn.Module to mlx.nn.Module conversion, 3D Gaussian Splatting rasterization (tile-based rendering, SH evaluation, EWA splatting, GLM convention handling, hybrid autodiff backward), vision foundation models (SAM + CLIP multi-backend with auto-detect factory pattern, mlx_clip, mlx_sam3, PyTorch MPS fallback), IsaacLab simulator backends, Stereolabs ZED stereo cameras, Intel RealSense depth processing (CUDA→MLX filters, point cloud, alignment), NVIDIA Triton Inference Server to MLX (server architecture, mlx-lm integration, OpenAI-compatible API, production hardening), ragged tensor patterns, packaging splits, benchmark harnesses, and macOS install/CI paths. Apply whenever replacing CUDA, Triton, Warp, or Linux assumptions with MLX, Metal, CPU fallback, or mac-native adapters.
Use this skill when extending the IsaacLab MLX/mac-sim port, the zed-sdk-mlx stereo camera port, the realsense-mlx depth processing port, the triton-mlx inference server port, or porting similar CUDA-first / Linux-first systems to Apple Silicon.
uv and runnable without Isaac Sim or Linux-only deps.EVERY TIME you start a port, you MUST check the latest MLX release on GitHub before writing any code:
# Check latest release — DO THIS FIRST
curl -s https://api.github.com/repos/ml-explore/mlx/releases/latest | jq -r '.tag_name, .published_at'
# Or: visit https://github.com/ml-explore/mlx/releases
Why: MLX ships new features and breaking changes frequently (monthly releases). Assuming an old API wastes hours. The nn module surface has expanded dramatically — v0.31+ has modules that v0.22 didn't.
| Package | Version | Release Date | PyPI |
|---|---|---|---|
| mlx | 0.31.1 | 2026-02-28 | pip install mlx>=0.31.0 |
| mlx-metal | 0.31.1 | 2026-02-28 | Auto-installed with mlx |
| mlx-lm | 0.31.1 | 2026-03-11 | pip install mlx-lm>=0.31.0 |
These are ALL available as of v0.31. Do NOT reimplement from scratch:
| Category | Available Modules |
|---|---|
| Linear | nn.Linear |
| Convolution | nn.Conv1d, nn.Conv2d, nn.Conv3d |
| Normalization | nn.LayerNorm, nn.BatchNorm, nn.RMSNorm, nn.GroupNorm |
| Recurrent | nn.LSTM, nn.GRU, nn.RNN |
| Attention | nn.MultiHeadAttention, nn.Transformer |
| Pooling | nn.AvgPool2d, nn.MaxPool2d |
| Embedding | nn.Embedding |
| Dropout | nn.Dropout |
| Activations | nn.ReLU, nn.GELU, nn.SiLU, nn.Softmax, nn.Sigmoid, nn.Tanh |
| Positional | nn.RoPE, nn.SinusoidalPositionalEncoding |
| Container | nn.Sequential |
| Optimizers | optim.SGD, optim.Adam, optim.AdamW, optim.RMSprop, optim.Adagrad, optim.Lion |
| LR Schedule | optim.linear_schedule, optim.cosine_decay, optim.step_decay |
Key convention: MLX Conv2d uses NHWC (channels last), NOT NCHW like PyTorch. All image tensors must be transposed.
Classify every blocker into one bucket before editing:
torch.cuda, device placement, RNG, checkpointing, training loopsApply the matching replacement:
ComputeBackend=mlxSimBackend=mac-simisaacsim, omni.*, carb, pxr, or warp unless that module is explicitly upstream-only.ComputeBackend, KernelBackend, SimBackend, SensorBackend, and PlannerBackend adapters over scattered conditionals.--compute-backend, --kernel-backend, and --sim-backend.references/progress_log.md.exposureDuration, iso, setExposureModeCustom, deviceWhiteBalanceGains, setWhiteBalanceModeLocked(with:), setExposureTargetBias. Only mode toggling (auto/locked) works. Do not fake parity.--enable-uvc-controls.pyramid_factor=2 or higher for live mode~/zed/settings/mlx_live_stereo.json (JSON format).
Depth demo auto-loads saved config. CLI flags override.rg -n "torch\\.cuda|import warp|from warp|import omni|from omni|import carb|from carb|import pxr|from pxr|import isaacsim|from isaacsim" <paths>
For ZED port:
rg -n "V4L2|v4l2|linux/videodev|ioctl|VIDIOC_" <paths>
uvreferences/progress_log.md.macOS AVFoundation API reality (doc-grounded, verified 2026-03-14):
| Control | macOS AVFoundation | Alternative |
|---|---|---|
| Auto/locked exposure toggle | YES (10.7+) | — |
| Auto/locked white balance toggle | YES (10.7+) | — |
| Manual exposure duration/ISO | NO (iOS-only) | IOKit UVC sideband |
| Manual white balance gains | NO (iOS-only) | IOKit UVC sideband |
| Exposure compensation/bias | NO (iOS-only) | IOKit UVC sideband |
| Gain control | NO (nowhere) | IOKit UVC sideband |
| ROI metering | API exists, devices return false | — |
IOKit UVC Sideband Architecture (stretch goal):
AVFoundation ──────→ Video streaming (untouched)
IOKit USB sideband ─→ UVC control requests (exposure, gain, WB)
--enable-uvc-controls flagCurrent baseline: ~14-15 FPS at 720p on ZED 2i / M5 Mac.
Optimization priority (ranked by leverage):
| Rank | Action | Complexity | Expected Gain |
|---|---|---|---|
| 1 | Set refinement_radius=0 for live | Trivial | +5-10 FPS |
| 2 | Confirm pyramid_factor=2 default | Trivial | +8-15 FPS |
| 3 | Async double-buffer MLX pipeline | Medium | +3-5 FPS |
| 4 | SDL2 display instead of cv2.imshow | Low-Medium | +2-4 FPS |
| 5 | Zero-copy SHM read (numpy view) | Low | +0.5-1 FPS |
| 6 | Native Metal viewer process | High | +3-5 FPS |
| 7 | Raw Metal compute shader | Very High | +10-20 FPS |
| 8 | Semaphore SHM signaling | Low-Medium | +1-2 FPS latency |
The old zed_open_capture_depth_tune_stereo (OpenCV SGBM) is replaced by:
scripts/live_mlx_disparity.py — MLX live tuner with interactive trackbarszed_mlx/live_config.py — Config dataclass, JSON serialization, validation~/zed/settings/mlx_live_stereo.json| Surface | Classification | Notes |
|---|---|---|
| Raw video capture | Native parity | AVFoundation, validated ZED 2i |
| Sensor capture | Native parity | Same hidapi path |
| Video/sensor sync | macOS replacement | Timestamp-fallback (no hw gate) |
| Calibration loading | Native parity | Same ~/zed/settings/ path |
| Rectification | Native parity | Same OpenCV remap |
| Video example | Native parity | AVFoundation + OpenCV display |
| Multi-camera | macOS replacement | Needs validation |
| Camera control | Capability-gated partial | Mode toggle only |
| Depth example | macOS replacement | MLX disparity (routed) |
| Tune-stereo | macOS replacement | MLX live tuner |
| Point-cloud export | Native parity | MLX + PLY |
| Point-cloud viewer | Deferred | Export only |
| ROS2 wrapper | Deferred | After base stability |
sample_heights(...)surface_normals(...)out_of_bounds(...)env_ids threading for flattened contact or raycast queriesfull benchmark run.test_mac_semantic_drift.py after refreshing the baseline.isaacsim_edge_aware_refine (biggest perf bottleneck)cv2.imshow display path (replace with SDL2/Metal)bytes() copy in shm_frame.py (zero-copy numpy view)PYTHONPATH=.:source/isaaclab:source/isaaclab_rl .venv/bin/pytest \
scripts/tools/test/test_bootstrap_isaac_sources.py \
source/isaaclab/test/backends/test_runtime.py \
source/isaaclab/test/backends/test_task_registry.py \
source/isaaclab/test/backends/test_kernel_inventory.py \
source/isaaclab/test/backends/test_portability_utils.py \
source/isaaclab/test/backends/test_planner_compat.py \
source/isaaclab/test/backends/test_ros2_bridge.py \
source/isaaclab/test/backends/test_mac_benchmark_suite.py \
source/isaaclab/test/backends/test_mac_semantic_drift.py \
source/isaaclab/test/backends/test_mac_cartpole.py \
source/isaaclab/test/backends/test_mac_cartpole_showcase.py \
source/isaaclab/test/backends/test_mac_cart_double_pendulum.py \
source/isaaclab/test/backends/test_mac_quadcopter.py \
source/isaaclab/test/backends/test_mac_anymal_c.py \
source/isaaclab/test/backends/test_mac_anymal_c_rough.py \
source/isaaclab/test/backends/test_mac_franka_reach.py \
source/isaaclab/test/backends/test_mac_franka_lift.py \
source/isaaclab/test/backends/test_mac_h1.py \
source/isaaclab_rl/test/test_import_safety.py \
source/isaaclab_rl/test/test_mlx_wrapper.py -q
# Full test suite
cd zed-sdk-mlx && .venv/bin/pytest tests/ -q
# Live smoke (requires connected ZED camera)
make smoke-zed-terminal
make live-mlx-disparity-terminal
# Benchmark
make mlx-stereo-benchmark
make mlx-stereo-eval
mac-sim capability snapshot is useful, but task benchmarks should also report the concrete simulator adapter contract for the implemented slice.UnsupportedBackendError or UnsupportedRuntimeFeatureError at use sites.pyramid_factor=2 with refinement_radius=0 roughly doubles FPS with minimal quality loss.This section captures battle-tested patterns from porting PointCNN++ (5 Triton kernels, 4 CUDA kernels, full ResNet backbone) to MLX. Use these patterns for any project with custom CUDA/Triton sparse ops.
Classify each kernel before porting:
| Type | Triton/CUDA Pattern | MLX Replacement | Complexity |
|---|---|---|---|
| Elementwise indexed | out[i] = f(a[idx[i]], b[idx[i]]) | Vectorized: a[idx] - b[idx] then reduce | EASY |
| Scatter-reduce | atomicAdd(out[idx], val) | mx.array.at[idx].add(val) or numpy np.add.at | MEDIUM |
| Blocked scatter | Tiled iteration + atomic flush | Python block loop + scatter per block | MEDIUM |
| Segment reduce | Per-segment sum/mean/max/min | repeat_interleave_indices → segment_ids → scatter | MEDIUM |
| Sparse matmul | Gather-multiply-scatter with atomics | Blocked: gather block, contract, scatter-add | HARD |
| Warp-level ops | __shfl_sync, __ballot_sync | Not available in mx.fast.metal_kernel; use shared memory or SIMD groups in C++ extension | HARD |
| Need | MLX API | Notes |
|---|---|---|
| Custom Metal kernel | mx.fast.metal_kernel() | JIT compiled, source as string |
| C++ extension | Subclass mlx::core::Primitive + CMake | For complex kernels needing atomics |
| Atomics in Metal | mlx_atomic<T>, atomic_fetch_add_explicit | Via MLX's atomic.h header |
| Scatter-add | arr.at[indices].add(values) | Returns new array (functional) |
| Custom autograd | @mx.custom_function + .vjp | Replaces torch.autograd.Function |
| nn.Module | mlx.nn.Module with __call__ | Parameters auto-discovered |
| Version check | importlib.metadata.version("mlx") | NOT mlx.__version__ (doesn't exist) |
| GPU sync | mx.eval(tensor) | Required for timing/materialization |
| No int64 scatter | Cast to int32 for GPU scatter ops | MLX Metal backend limitation |
This is the single most important pattern. MVMR, VVOR, and segment reduces all use it.
Problem: Triton/CUDA use atomicAdd to accumulate results from multiple threads into shared output slots. MLX doesn't have atomic scatter in pure Python.
Solution: Blocked iteration with scatter-add
BLOCK = min(T, 4096)
output = mx.zeros((n_o, ...))
for t_start in range(0, T, BLOCK):
t_end = min(t_start + BLOCK, T)
# 1. Gather inputs for this block
a_block = a[a_idx[t_start:t_end]]
b_block = b[b_idx[t_start:t_end]]
o_block = o_idx[t_start:t_end]
# 2. Compute (elementwise/contract/outer product)
products = compute(a_block, b_block)
# 3. Scatter-add to output
output = output.at[o_block].add(products)
return output
Key insight: mx.array.at[indices].add(values) handles duplicate indices correctly — it accumulates. This replaces CUDA atomicAdd without explicit atomics.
Block size tuning:
G * M * C productEvery differentiable sparse op needs a @mx.custom_function wrapper:
@mx.custom_function
def my_sparse_op(x, indices, lengths):
return _forward(x, indices, lengths)
@my_sparse_op.vjp
def my_sparse_op_vjp(primals, cotangents, output):
x, indices, lengths = primals
grad_output = cotangents
# Compute grad_x based on op type:
# SUM: scatter grad_output back to x positions
# MEAN: scatter grad_output / length
# MAX: scatter grad_output * (x == output) mask
grad_x = _backward(grad_output, x, indices, lengths, output)
return (grad_x, None, None) # None for non-differentiable args
Circular dependency handling (MVMR ↔ VVOR): MVMR backward needs VVOR, VVOR backward needs MVMR. Use late imports inside VJP functions:
@my_mvmr.vjp
def mvmr_vjp(primals, cotangents, output):
from pointelligence_mlx.sparse_engines.vvor import sparse_vector_vector_outer_product_reduction
# ... use VVOR for grad_a
MLX doesn't have torch.repeat_interleave. This is the universal replacement:
def repeat_interleave_indices(repeats):
"""[2, 0, 3, 1] → [0, 0, 2, 2, 2, 3]"""
offsets = cumsum_exclusive(repeats)
output_size = int(mx.sum(repeats).item())
marker = mx.zeros((output_size,), dtype=mx.int32)
# Set markers at segment boundaries
valid_mask = repeats > 0
valid_offsets = offsets[valid_mask_indices]
marker = marker.at[valid_offsets].add(mx.ones_like(valid_offsets))
return mx.cumsum(marker) - 1 # 0-indexed segment IDs
This single function powers: segment IDs for reduce, neighbor list expansion, batch index computation, and triplet building.
| PyTorch | MLX | Notes |
|---|---|---|
nn.Module | mlx.nn.Module | |
forward() | __call__() | MLX convention |
nn.Parameter | Direct attribute | Auto-discovered |
register_buffer | self.name = mx.array(...) | No registration needed |
module.train()/eval() | No equivalent | Handle BN manually |
nn.Linear(in, out) | mlx.nn.Linear(in, out) | Direct match |
nn.BatchNorm1d | Custom RaggedBatchNorm | For variable-length batches |
F.relu(x) | mlx.nn.relu(x) | |
nn.Sequential | Custom MultiSequential | For (x, metadata) tuples |
Always use uv for MLX projects:
uv venv .venv --python 3.12
uv pip install -e ".[dev]"
--break-system-packages issues[build-system]
requires = ["setuptools>=68.0", "wheel"]
build-backend = "setuptools.build_meta" # NOT _legacy:_Backend
[project]
dependencies = ["mlx>=0.31.0", "numpy>=1.24.0", "scipy>=1.10.0"]
[project.optional-dependencies]
dev = ["pytest>=7.0", "pytest-benchmark>=4.0", "torch>=2.6.0"]
def check_all_close(mlx_result, reference, atol=1e-5, rtol=1e-5):
"""Compare with relative tolerance scaling (matches PyTorch test patterns)."""
actual = np.array(mlx_result)
expected = np.array(reference) # works with torch.Tensor, np.ndarray, mx.array
scale = max(1.0, np.abs(expected).max())
actual_atol = atol * scale
assert np.allclose(actual, expected, atol=actual_atol, rtol=rtol)
Default tolerances per op type:
MLX has no built-in gradcheck. Use finite differences:
def check_gradient(fn, args, argnums, eps=1e-4, atol=1e-2):
grad_fn = mx.grad(scalar_fn, argnums=argnums)
analytical = grad_fn(*args)
# Central finite differences per element (O(N) evals)
# For large tensors: random projection check (O(K) evals)
mlx.__version__ doesn't exist — use importlib.metadata.version("mlx")argsort with sentinel values instead of x[mask]mx.eval() is mandatory — without it, lazy graph grows unbounded; parameters don't materialize between training stepstorch.repeat_interleave — use the marker+cumsum approach abovetorch.searchsorted — use numpy fallback: np.searchsorted(np.array(sorted), np.array(query))torch.cdist — compute manually: ((a[:, None] - b[None, :]) ** 2).sum(-1).sqrt()torch.unique_consecutive — mx.unique is always sorted (which is what you want)torch.segment_reduce — implement via scatter ops (this IS the port)mx.fast.metal_kernel() template params — use constant buffers, not template argsarr.at[idx].add(val) returns a NEW array; no in-place mutationFor large ports (5+ kernels), use this parallelization pattern:
Phase 1 (sequential): Foundation → Testing Harness → Core Utilities
Phase 2 (parallel): All kernel PRDs simultaneously (they share only Phase 1 deps)
Phase 3 (sequential): Integration layers → Model → Training
Each kernel agent gets: function signature, numpy reference impl, test parameters, tolerance spec. They work independently and are validated against the full suite after completion.
For any kernel-heavy port, create individual PRDs with:
This section covers porting Intel RealSense SDK (librealsense) CUDA processing kernels and depth filters to MLX for Apple Silicon.
Wrap, don't replace. pyrealsense2 handles camera capture via libuvc on macOS (already works). We only port the compute/processing layer to MLX:
| Bucket | Source | MLX Replacement | Difficulty |
|---|---|---|---|
| Format conversion | 10 CUDA kernels (cuda-conversion.cu) | Vectorized bitwise/arithmetic | EASY |
| Point cloud | CUDA deproject (cuda-pointcloud.cu) | Precomputed coord grids + broadcast multiply | MEDIUM |
| Alignment | 5 CUDA kernels (cuda-align.cu) | Vectorized matmul + sort-based scatter-min | HARD |
| Spatial filter | CPU recursive (spatial-filter.cpp) | Row-parallel scan | MEDIUM-HARD |
| Temporal filter | CPU stateful (temporal-filter.cpp) | Vectorized alpha blend + state mgmt | MEDIUM |
| Decimation | CPU median/mean (decimation-filter.cpp) | Reshape + mx.median / mx.mean | EASY |
| Hole filling | CPU neighbor (hole-filling-filter.cpp) | Neighbor gather + min/max | EASY |
| Disparity | CPU division (disparity-transform.cpp) | Elementwise mx.where + division | TRIVIAL |
| Colorizer | CPU histogram + LUT (colorizer.cpp) | CPU histogram + MLX LUT gather | EASY-MEDIUM |
The SSE pointcloud optimization pattern applies perfectly to MLX:
# Precompute ONCE per intrinsics change (cache)
x_grid = (mx.arange(W) - ppx) / fx # (W,)
y_grid = (mx.arange(H) - ppy) / fy # (H,)
# Per frame: broadcast multiply (fast)
z = depth.astype(mx.float32) * depth_scale
X = x_grid[None, :] * z # (H, W)
Y = y_grid[:, None] * z # (H, W)
points = mx.stack([X, Y, z], axis=-1) # (H, W, 3)
CUDA atomicMin has no MLX equivalent. Workaround:
# Sort by target index, then take first (min) per segment
order = mx.argsort(target_indices)
sorted_vals = values[order]
# Segment boundaries → numpy fallback for scatter-min
np.minimum.at(output, np.array(sorted_indices), np.array(sorted_vals))
Sequential dependency prevents full vectorization. Process all rows in parallel, sequential within each row:
FORCE_RSUSB_BACKEND (libuvc) on Apple, NOT V4L2| Surface | Status | Notes |
|---|---|---|
| Camera capture | ✅ Already works | pyrealsense2 + libuvc on macOS |
| Format converters | PRD-01 | 10 CUDA kernels → MLX |
| Point cloud | PRD-02 | CUDA deproject → MLX |
| Depth filters (×5) | PRD-03 | CPU/CUDA → MLX |
| Alignment | PRD-04 | CUDA align → MLX |
| Colorizer | PRD-05 | CPU → MLX |
| Display viewer | PRD-06 | OpenCV → SDL2/Metal |
| Integration tests | PRD-07 | Full test suite |
| Benchmarks | PRD-08 | Performance validation |
cd realsense-mlx
uv venv .venv --python 3.12
uv pip install -e ".[dev]"
.venv/bin/pytest tests/ -q
This section captures patterns from porting NVIDIA Triton Inference Server (v2.67.0dev) to Apple Silicon. Unlike kernel/compute ports (Pointelligence, ZED), this is a server architecture port — replacing CUDA infrastructure in a production inference serving system.
| Bucket | NVIDIA Original | MLX Replacement |
|---|---|---|
| inference backend | CUDA/TensorRT model execution | mlx_lm.stream_generate() with sampler API |
| memory management | CUDA IPC, pinned memory, GPU pools | Unified memory (no copies), mx.get_active_memory() |
| model loading | custom C++ backend loaders | mlx_lm.load() → (model, tokenizer) |
| metrics/profiling | DCGM GPU metrics, NVTX tracing | mx.get_active/peak/cache_memory(), prometheus-client |
| server core | C++ (libevhtp HTTP, gRPC async) | Python-first (FastAPI + uvicorn) |
| protocols | KServe v2 + OpenAI-compatible | OpenAI-compatible REST (drop-in for any OpenAI SDK) |
| deployment | Docker/Linux containers | macOS native, uv pip install -e . |
The upstream Triton is a C++ server with Python frontends. For the MLX port, we chose Python-first:
This is the right call for Apple Silicon because:
mlx_lm.stream_generate() already handles KV-cache, sampling, tokenizationThe mlx-lm API changed significantly between versions. Key findings:
# OLD API (pre-0.30) — BROKEN, do not use
from mlx_lm.utils import generate_step # ❌ Moved
generate_step(prompt, model, temp=0.7) # ❌ temp/top_p kwargs removed
# NEW API (v0.31+) — correct
from mlx_lm import stream_generate, generate
from mlx_lm.sample_utils import make_sampler, make_logits_processors
# Sampling is now via callable sampler
sampler = make_sampler(temp=0.7, top_p=0.9)
# Penalties via logits processors
processors = make_logits_processors(
repetition_penalty=1.1,
frequency_penalty=0.5,
presence_penalty=0.3,
)
# stream_generate yields GenerationResponse dataclass
for response in stream_generate(model, tokenizer, prompt, max_tokens=256,
sampler=sampler, logits_processors=processors):
print(response.text, end="")
# response.token, response.logprobs, response.prompt_tokens,
# response.generation_tokens, response.generation_tps,
# response.peak_memory, response.finish_reason
GenerationResponse fields (v0.31+):
text: str — decoded text segmenttoken: int — token IDlogprobs: mx.array — log probabilitiesprompt_tokens: int — prompt token countgeneration_tokens: int — generated token countgeneration_tps: float — tokens per secondpeak_memory: float — peak memory in GBfinish_reason: Optional[str] — "length", "stop", or None# OLD — deprecated, will be removed
mx.metal.get_active_memory()
mx.metal.get_peak_memory()
mx.metal.get_cache_memory()
# NEW — use these
mx.get_active_memory()
mx.get_peak_memory()
mx.get_cache_memory()
For porting any CUDA inference server to MLX:
mlx_lm.load() and mlx_lm.stream_generate().config.json is a model.async def stream_chat(engine, request):
completion_id = f"chatcmpl-{uuid4().hex[:12]}"
created = int(time.time())
# First chunk: role only
yield format_sse({"choices": [{"delta": {"role": "assistant"}}]})
# Content chunks
async for chunk in engine.generate_stream(...):
if chunk.text:
yield format_sse({"choices": [{"delta": {"content": chunk.text}}]})
# Final chunk: finish_reason
yield format_sse({"choices": [{"delta": {}, "finish_reason": "stop"}]})
yield "data: [DONE]\n\n"
def format_sse(obj):
return f"data: {json.dumps(obj)}\n\n"
def discover_models(repo_path: Path) -> list[ModelSpec]:
for item in repo_path.iterdir():
if not item.is_dir() or item.name.startswith("."):
continue
config_path = item / "config.json"
if config_path.exists():
config = json.loads(config_path.read_text())
model_type = config.get("model_type", "unknown")
# Detect quantization from config
quant = config.get("quantization_config", {}).get("bits")
yield ModelSpec(name=item.name, path=str(item), ...)
def estimate_memory(config: dict) -> int:
"""Rough memory estimate in bytes."""
H = config.get("hidden_size", 4096)
L = config.get("num_hidden_layers", 32)
I = config.get("intermediate_size", H * 4)
V = config.get("vocab_size", 32000)
params_per_layer = 4*H*H + 3*H*I # attention + MLP
total_params = L * params_per_layer + 2 * V * H
bits = config.get("quantization_config", {}).get("bits", 16)
return int(total_params * (bits / 8) * 1.1) # 10% overhead
Apple Silicon has limited GPU memory. Use asyncio.Semaphore to prevent OOM:
class ConcurrencyLimiter:
def __init__(self, max_concurrent: int = 4):
self._semaphore = asyncio.Semaphore(max_concurrent)
async def acquire(self) -> bool:
return self._semaphore.locked() is False # non-blocking check
# Or: await self._semaphore.acquire() for blocking
Return 429 Too Many Requests in OpenAI error format when saturated.
For server ports, the parallelization is different from kernel ports:
Phase 1 (sequential): Foundation + Testing Harness
- pyproject.toml, project structure, conftest.py
- Mock fixtures for engine, model manager, tokenizer
Phase 2 (parallel): All modules simultaneously
- Model Manager (discovery, loading, unloading)
- Inference Runtime (generate, stream, embed)
- Engine Adapter (OpenAI ↔ MLX bridge)
- API Routers (chat, completions, embeddings, models, health)
- Memory Manager (Metal tracking)
- Model Repository (config parsing, polling)
Phase 3 (parallel): Production hardening
- Metrics (Prometheus)
- Auth (API key middleware)
- Concurrency (Semaphore limiter)
- Logging (structured JSON)
- Error handling (OpenAI error format)
- Model warm-up and caching
Phase 4 (sequential): Integration + E2E tests
github.com/RobotFlow-Labs/triton-mlxmlx_lm generate API changed completely — v0.31+ uses sampler callable, not temp/top_p kwargs. generate_step moved and requires different parameters.mx.metal.* deprecated — use mx.get_active_memory() etc. (no .metal. prefix)mlx_lm.load() is 5-10x slower (Metal shader compilation). Always do a dummy forward pass on load.OpenAI(base_url="http://testserver/v1", http_client=test_client) enables SDK-level testing without a running server.data: {json}\n\n with double newline. First chunk must have role only. Last data line is data: [DONE]\n\n.stream_generate handles EOS internally — It breaks on EOS tokens automatically; you just need to detect finish_reason in the response.apply_chat_template. Always have a simple concatenation fallback.This section covers porting the INRIA GRAPHDECO differentiable Gaussian splatting rasterizer from CUDA to MLX. The original has 5 CUDA kernels, ~4300 LOC, using CUB, GLM, and cooperative_groups.
| CUDA Kernel | Type | MLX Strategy | Complexity |
|---|---|---|---|
preprocessCUDA (per-Gaussian) | Elementwise | Pure MLX vectorized ops | MEDIUM |
computeCov3D (quat→cov) | Elementwise | Pure MLX: M = s[:,None] * R; Sigma = M^T @ M | EASY |
computeCov2D (EWA projection) | Elementwise | Pure MLX: Jacobian + matrix product | EASY |
computeColorFromSH (SH→RGB) | Elementwise | Pure MLX: degree-by-degree eval | EASY |
duplicateWithKeys + CUB sort | Sort/scatter | np.lexsort on (tile_id, depth) pairs | MEDIUM |
identifyTileRanges | Segment boundaries | np.diff + boundary detection | EASY |
renderCUDA forward (tile alpha blend) | Tile-based with shared mem | NumPy tile loops (Phase 1) → Metal kernel (Phase 2) | HARD |
renderCUDA backward (reverse alpha) | Tile-based with atomics | NumPy tile loops (Phase 1) → Metal kernel (Phase 2) | HARD |
CRITICAL LESSON: GLM uses column-major storage with column-vector convention. Manually porting GLM matrix derivatives (especially dL_dM = 2.0f * M * dL_dSigma) is extremely error-prone. The * operator in GLM performs standard matrix multiply, but the column-major storage means indices map differently than row-major C/Python.
Solution: For any function that is already implemented as pure MLX ops in the forward pass, use mx.grad() for the backward pass instead of manually porting the CUDA backward:
def compute_cov3D_backward(scales, scale_modifier, rotations, dL_dcov3D):
"""Use MLX autodiff through forward — correct by construction."""
from .projection import compute_cov3D
def loss_scales(s):
cov = compute_cov3D(s, scale_modifier, rotations)
return mx.sum(dL_dcov3D * cov)
def loss_rotations(r):
cov = compute_cov3D(scales, scale_modifier, r)
return mx.sum(dL_dcov3D * cov)
return mx.grad(loss_scales)(scales), mx.grad(loss_rotations)(rotations)
When to manually port CUDA backward vs. use MLX autodiff:
CUDA/GLM stores 4x4 matrices column-major as flat float[16]:
matrix[0]=m00 matrix[4]=m01 matrix[8]=m02 matrix[12]=m03
matrix[1]=m10 matrix[5]=m11 matrix[9]=m12 matrix[13]=m13
matrix[2]=m20 matrix[6]=m21 matrix[10]=m22 matrix[14]=m23
matrix[3]=m30 matrix[7]=m31 matrix[11]=m32 matrix[15]=m33
In MLX, keep the flat (16,) array and index directly:
def transform_point_4x3(points, matrix):
m = matrix # keep flat, do NOT reshape
x, y, z = points[:, 0], points[:, 1], points[:, 2]
tx = m[0] * x + m[4] * y + m[8] * z + m[12]
ty = m[1] * x + m[5] * y + m[9] * z + m[13]
tz = m[2] * x + m[6] * y + m[10] * z + m[14]
return mx.stack([tx, ty, tz], axis=-1)
Do NOT reshape to (4,4) — it causes broadcasting errors and index confusion between row-major and column-major.
CUB radix sort is replaced with numpy lexsort:
# Expand visible Gaussians to tile-Gaussian pairs
gaussian_ids = np.repeat(visible_ids, tiles_per_gaussian)
expanded_depths = np.repeat(vis_depths, tiles_per_gaussian)
# Sort by (tile_id, depth) — lexsort sorts by last key first
sort_order = np.lexsort((expanded_depths, tile_ids))
# Build per-tile ranges from sorted boundaries
changes = np.concatenate([[0], np.where(np.diff(sorted_tile_ids) != 0)[0] + 1, [total]])
for i in range(len(changes) - 1):
ranges[sorted_tile_ids[changes[i]]] = (changes[i], changes[i+1])
The rasterize pipeline has non-differentiable steps (sorting, tile assignment, culling), so mx.grad can't autodiff through the whole thing. Use @mx.custom_function with a custom VJP:
@mx.custom_function
def rasterize_diff(means3D, colors, opacities, ...):
# Forward: preprocess → sort → render
return rendered_image
@rasterize_diff.vjp
def rasterize_vjp(primals, cotangents, output):
# 1. Re-run forward to get intermediate state (cov, sorted lists, etc.)
# 2. Render backward (reverse traversal) → dL/dmean2D, dL/dconic, dL/dcolor, dL/dopacity
# 3. Preprocess backward (use mx.grad for cov3D/SH, manual for projection)
return (dL_dmeans3D, dL_dcolors, dL_dopacity, dL_dscales, dL_drotations, ...)
Pack non-differentiable settings into an array to work with @mx.custom_function:
settings_packed = mx.array([H, W, tanfovx, tanfovy, scale_mod, sh_deg, use_sh, prefiltered])
The backward pass is best split into two stages:
dL/dmean2D, dL/dconic, dL/dcolor, dL/dopacity.mx.grad() through the pure MLX forward functions for cov3D, cov2D, and SH. Only the projection gradient (NDC→pixel coordinate chain) needs manual implementation.| Phase | Description | Status |
|---|---|---|
| Phase 1 | Pure Python/MLX forward pass + numpy renderer | Reference implementation |
| Phase 2 | @mx.custom_function VJP + backward pass | Needed for training |
| Phase 3 | Metal kernels for renderer (16x16 tile blocks with shared memory) | Performance |
| Phase 4 | Cross-validation vs CUDA reference + benchmark suite | Quality |
mx.zeros_like(arr, dtype=...) not supported — Use mx.zeros(arr.shape, dtype=...) instead.mx.array.squeeze(-1) on (1,1) → scalar — Use arr[:, 0] to get (1,) shape reliably.mx.grad() and it's correct by construction.np.lexsort approach works but is a bottleneck for >100K Gaussians. A Metal kernel for radix sort would be the Phase 3 optimization.mx.eval() before numpy conversion — Always materialize MLX arrays before passing to numpy for the sorting/rendering stages.github.com/RobotFlow-Labs/diff-gaussian-rasterization-mlxgithub.com/graphdeco-inria/diff-gaussian-rasterizationreferences/progress_log.mdzed-sdk-mlx/claude_supervision.mdzed-sdk-mlx/references/progress_log.mdgithub.com/RobotFlow-Labs/pointelligence-mlxrealsense-mlx/prds/librealsense/ (Intel RealSense SDK v2.57.6)github.com/RobotFlow-Labs/triton-mlxgithub.com/triton-inference-server/server (v2.67.0dev)github.com/RobotFlow-Labs/diff-gaussian-rasterization-mlxgithub.com/graphdeco-inria/diff-gaussian-rasterizationgithub.com/RobotFlow-Labs/project_skuldgithub.com/RobotFlow-Labs/project_tyrgithub.com/ml-explore/mlx/releases (CHECK THIS FIRST)This section covers porting SAM (Segment-Anything) and CLIP from PyTorch CUDA to MLX for open-vocabulary segmentation on Apple Silicon. Validated in ANIMA Skuld (OVerSeeC) project.
For vision model ports where multiple backends must coexist, use a factory + auto-detect pattern instead of scattered conditionals:
# backend.py — single entry point
def detect_device() -> str:
"""MLX > CUDA > MPS > CPU priority."""
if _is_apple_silicon() and _has_mlx():
return "mlx"
if _has_torch_cuda():
return "cuda:0"
if _is_apple_silicon() and _has_torch_mps():
return "mps"
return "cpu"
def create_clip_encoder(device: str):
device = resolve_device(device) # "auto" → concrete
if device == "mlx":
from .clip_encoder_mlx import CLIPEncoderMLX
return CLIPEncoderMLX()
from .clip_encoder import CLIPEncoder
return CLIPEncoder(device=device)
Key insight: The factory returns objects with identical APIs — callers never know which backend they're using. The similarity() method returns numpy arrays from both backends.
| Model | MLX Package | Install | API Style |
|---|---|---|---|
| CLIP | mlx_clip (harperreed) | pip install mlx_clip or pip install git+https://github.com/harperreed/mlx_clip.git | clip.text_encoder(texts), clip.image_encoder(path) |
| CLIP | ml-explore/mlx-examples/clip | Clone + local import | model(**inputs) → .text_embeds, .image_embeds |
| SAM 3 | mlx_sam3 (Deekshith-Dade) | pip install git+https://github.com/Deekshith-Dade/mlx_sam3.git | SAM3().segment(image) |
| SAM | PyTorch on MPS | pip install segment-anything | Standard SAM API on "mps" device |
mlx_clip.image_encoder() expects a file path, not numpy array — Must save to temp file first:
with tempfile.NamedTemporaryFile(suffix=".png") as f:
Image.fromarray(image).save(f.name)
embedding = clip.image_encoder(f.name)
torch.Tensor, mlx_clip returns mlx.core.array. Always normalize to numpy at the API boundary:
emb_np = np.array(embedding)
norms = np.linalg.norm(emb_np, axis=-1, keepdims=True)
return emb_np / (norms + 1e-8)
mlx_clip auto-converts from HF: MLXClip("openai/clip-vit-large-patch14"). First load is slow (weight conversion), subsequent loads are cached.segment_anything.SamAutomaticMaskGenerator. Write an adapter:
class _MLXSam3Generator:
def generate(self, image):
results = self._model.segment(image)
return [{"segmentation": r["mask"].astype(bool),
"area": int(r["mask"].sum()),
"predicted_iou": r.get("iou", 0.9)} for r in results]
segment-anything with model.to("mps") runs well. Use this when mlx_sam3 isn't installed:
def _try_load_torch_mps(self):
model = sam_model_registry[vit_type](checkpoint=ckpt_path)
model.to("mps") # Apple GPU via PyTorch