技能檔案

Port To MLX

Name: Port To MLX
Author: RobotFlow-Labs

Use when porting CUDA-first or Linux-first code to Apple Silicon MLX/Metal. Covers Triton/CUDA kernel porting to MLX Metal (sparse ops, atomics, custom VJP, mx.fast.metal_kernel), PyTorch nn.Module to mlx.nn.Module conversion, 3D Gaussian Splatting rasterization (tile-based rendering, SH evaluation, EWA splatting, GLM convention handling, hybrid autodiff backward), vision foundation models (SAM + CLIP multi-backend with auto-detect factory pattern, mlx_clip, mlx_sam3, PyTorch MPS fallback), IsaacLab simulator backends, Stereolabs ZED stereo cameras, Intel RealSense depth processing (CUDA→MLX filters, point cloud, alignment), NVIDIA Triton Inference Server to MLX (server architecture, mlx-lm integration, OpenAI-compatible API, production hardening), ragged tensor patterns, packaging splits, benchmark harnesses, and macOS install/CI paths. Apply whenever replacing CUDA, Triton, Warp, or Linux assumptions with MLX, Metal, CPU fallback, or mac-native adapters.

RobotFlow-Labs0 星標2026年4月1日

職業
分類: 框架內部結構

技能內容

Use this skill when extending the IsaacLab MLX/mac-sim port, the zed-sdk-mlx stereo camera port, the realsense-mlx depth processing port, the triton-mlx inference server port, or porting similar CUDA-first / Linux-first systems to Apple Silicon.

Goals

Preserve user-facing workflow shape where practical.
Move CUDA, Warp, V4L2, and Isaac Sim assumptions behind explicit backend interfaces.
Prefer capability-gated unsupported errors over import-time crashes or silent failures.
Keep the public macOS path installable with uv and runnable without Isaac Sim or Linux-only deps.
For stereo camera work: MLX is the preferred compute backend; CPU/OpenCV is fallback only.

MANDATORY: Check MLX Version Before Starting

EVERY TIME you start a port, you MUST check the latest MLX release on GitHub before writing any code:

# Check latest release — DO THIS FIRST
curl -s https://api.github.com/repos/ml-explore/mlx/releases/latest | jq -r '.tag_name, .published_at'
# Or: visit https://github.com/ml-explore/mlx/releases

相關技能

Port To MLX | Skills Pool

技能檔案

Port To MLX

RobotFlow-Labs0 星標2026年4月1日

職業
分類: 框架內部結構

技能內容

Goals

Preserve user-facing workflow shape where practical.
Move CUDA, Warp, V4L2, and Isaac Sim assumptions behind explicit backend interfaces.
Prefer capability-gated unsupported errors over import-time crashes or silent failures.
Keep the public macOS path installable with uv and runnable without Isaac Sim or Linux-only deps.
For stereo camera work: MLX is the preferred compute backend; CPU/OpenCV is fallback only.

MANDATORY: Check MLX Version Before Starting

EVERY TIME you start a port, you MUST check the latest MLX release on GitHub before writing any code:

# Check latest release — DO THIS FIRST
curl -s https://api.github.com/repos/ml-explore/mlx/releases/latest | jq -r '.tag_name, .published_at'
# Or: visit https://github.com/ml-explore/mlx/releases

相關技能

Category	Available Modules
Linear	`nn.Linear`
Convolution	`nn.Conv1d`, `nn.Conv2d`, `nn.Conv3d`
Normalization	`nn.LayerNorm`, `nn.BatchNorm`, `nn.RMSNorm`, `nn.GroupNorm`
Recurrent	`nn.LSTM`, `nn.GRU`, `nn.RNN`
Attention	`nn.MultiHeadAttention`, `nn.Transformer`
Pooling	`nn.AvgPool2d`, `nn.MaxPool2d`
Embedding	`nn.Embedding`
Dropout	`nn.Dropout`
Activations	`nn.ReLU`, `nn.GELU`, `nn.SiLU`, `nn.Softmax`, `nn.Sigmoid`, `nn.Tanh`
Positional	`nn.RoPE`, `nn.SinusoidalPositionalEncoding`
Container	`nn.Sequential`
Optimizers	`optim.SGD`, `optim.Adam`, `optim.AdamW`, `optim.RMSprop`, `optim.Adagrad`, `optim.Lion`
LR Schedule	`optim.linear_schedule`, `optim.cosine_decay`, `optim.step_decay`

Scan for direct imports or assumptions:

rg -n "torch\\.cuda|import warp|from warp|import omni|from omni|import carb|from carb|import pxr|from pxr|import isaacsim|from isaacsim" <paths>

For ZED port:

rg -n "V4L2|v4l2|linux/videodev|ioctl|VIDIOC_" <paths>

Decide whether the target is:
- shared code that must be import-safe on macOS
- upstream-only code that should be lazy-gated
- a feature that needs a mac-native replacement
Patch the narrowest seam that removes the hard dependency.
Add tests for:
- import safety
- runtime selection/capability reporting
- the new MLX/mac-sim behavior
Run focused validation:
- backend tests
- smoke scripts
- install smoke with uv
- benchmark smoke if performance-facing
Append the outcome to references/progress_log.md.

Control	macOS AVFoundation	Alternative
Auto/locked exposure toggle	YES (10.7+)	—
Auto/locked white balance toggle	YES (10.7+)	—
Manual exposure duration/ISO	NO (iOS-only)	IOKit UVC sideband
Manual white balance gains	NO (iOS-only)	IOKit UVC sideband
Exposure compensation/bias	NO (iOS-only)	IOKit UVC sideband
Gain control	NO (nowhere)	IOKit UVC sideband
ROI metering	API exists, devices return false	—

AVFoundation ──────→ Video streaming (untouched)
IOKit USB sideband ─→ UVC control requests (exposure, gain, WB)

Rank	Action	Complexity	Expected Gain
1	Set `refinement_radius=0` for live	Trivial	+5-10 FPS
2	Confirm `pyramid_factor=2` default	Trivial	+8-15 FPS
3	Async double-buffer MLX pipeline	Medium	+3-5 FPS
4	SDL2 display instead of cv2.imshow	Low-Medium	+2-4 FPS
5	Zero-copy SHM read (numpy view)	Low	+0.5-1 FPS
6	Native Metal viewer process	High	+3-5 FPS
7	Raw Metal compute shader	Very High	+10-20 FPS
8	Semaphore SHM signaling	Low-Medium	+1-2 FPS latency

Surface	Classification	Notes
Raw video capture	Native parity	AVFoundation, validated ZED 2i
Sensor capture	Native parity	Same hidapi path
Video/sensor sync	macOS replacement	Timestamp-fallback (no hw gate)
Calibration loading	Native parity	Same ~/zed/settings/ path
Rectification	Native parity	Same OpenCV remap
Video example	Native parity	AVFoundation + OpenCV display
Multi-camera	macOS replacement	Needs validation
Camera control	Capability-gated partial	Mode toggle only
Depth example	macOS replacement	MLX disparity (routed)
Tune-stereo	macOS replacement	MLX live tuner
Point-cloud export	Native parity	MLX + PLY
Point-cloud viewer	Deferred	Export only
ROS2 wrapper	Deferred	After base stability

PYTHONPATH=.:source/isaaclab:source/isaaclab_rl .venv/bin/pytest \
  scripts/tools/test/test_bootstrap_isaac_sources.py \
  source/isaaclab/test/backends/test_runtime.py \
  source/isaaclab/test/backends/test_task_registry.py \
  source/isaaclab/test/backends/test_kernel_inventory.py \
  source/isaaclab/test/backends/test_portability_utils.py \
  source/isaaclab/test/backends/test_planner_compat.py \
  source/isaaclab/test/backends/test_ros2_bridge.py \
  source/isaaclab/test/backends/test_mac_benchmark_suite.py \
  source/isaaclab/test/backends/test_mac_semantic_drift.py \
  source/isaaclab/test/backends/test_mac_cartpole.py \
  source/isaaclab/test/backends/test_mac_cartpole_showcase.py \
  source/isaaclab/test/backends/test_mac_cart_double_pendulum.py \
  source/isaaclab/test/backends/test_mac_quadcopter.py \
  source/isaaclab/test/backends/test_mac_anymal_c.py \
  source/isaaclab/test/backends/test_mac_anymal_c_rough.py \
  source/isaaclab/test/backends/test_mac_franka_reach.py \
  source/isaaclab/test/backends/test_mac_franka_lift.py \
  source/isaaclab/test/backends/test_mac_h1.py \
  source/isaaclab_rl/test/test_import_safety.py \
  source/isaaclab_rl/test/test_mlx_wrapper.py -q

# Full test suite
cd zed-sdk-mlx && .venv/bin/pytest tests/ -q

# Live smoke (requires connected ZED camera)
make smoke-zed-terminal
make live-mlx-disparity-terminal

# Benchmark
make mlx-stereo-benchmark
make mlx-stereo-eval

Type	Triton/CUDA Pattern	MLX Replacement	Complexity
Elementwise indexed	`out[i] = f(a[idx[i]], b[idx[i]])`	Vectorized: `a[idx] - b[idx]` then reduce	EASY
Scatter-reduce	`atomicAdd(out[idx], val)`	`mx.array.at[idx].add(val)` or numpy `np.add.at`	MEDIUM
Blocked scatter	Tiled iteration + atomic flush	Python block loop + scatter per block	MEDIUM
Segment reduce	Per-segment sum/mean/max/min	`repeat_interleave_indices` → segment_ids → scatter	MEDIUM
Sparse matmul	Gather-multiply-scatter with atomics	Blocked: gather block, contract, scatter-add	HARD
Warp-level ops	`__shfl_sync`, `__ballot_sync`	Not available in `mx.fast.metal_kernel`; use shared memory or SIMD groups in C++ extension	HARD

Need	MLX API	Notes
Custom Metal kernel	`mx.fast.metal_kernel()`	JIT compiled, source as string
C++ extension	Subclass `mlx::core::Primitive` + CMake	For complex kernels needing atomics
Atomics in Metal	`mlx_atomic<T>`, `atomic_fetch_add_explicit`	Via MLX's `atomic.h` header
Scatter-add	`arr.at[indices].add(values)`	Returns new array (functional)
Custom autograd	`@mx.custom_function` + `.vjp`	Replaces `torch.autograd.Function`
nn.Module	`mlx.nn.Module` with `__call__`	Parameters auto-discovered
Version check	`importlib.metadata.version("mlx")`	NOT `mlx.__version__` (doesn't exist)
GPU sync	`mx.eval(tensor)`	Required for timing/materialization
No int64 scatter	Cast to int32 for GPU scatter ops	MLX Metal backend limitation

BLOCK = min(T, 4096)
output = mx.zeros((n_o, ...))

for t_start in range(0, T, BLOCK):
    t_end = min(t_start + BLOCK, T)

    # 1. Gather inputs for this block
    a_block = a[a_idx[t_start:t_end]]
    b_block = b[b_idx[t_start:t_end]]
    o_block = o_idx[t_start:t_end]

    # 2. Compute (elementwise/contract/outer product)
    products = compute(a_block, b_block)

    # 3. Scatter-add to output
    output = output.at[o_block].add(products)

return output

@mx.custom_function
def my_sparse_op(x, indices, lengths):
    return _forward(x, indices, lengths)

@my_sparse_op.vjp
def my_sparse_op_vjp(primals, cotangents, output):
    x, indices, lengths = primals
    grad_output = cotangents

    # Compute grad_x based on op type:
    # SUM: scatter grad_output back to x positions
    # MEAN: scatter grad_output / length
    # MAX: scatter grad_output * (x == output) mask
    grad_x = _backward(grad_output, x, indices, lengths, output)

    return (grad_x, None, None)  # None for non-differentiable args

@my_mvmr.vjp
def mvmr_vjp(primals, cotangents, output):
    from pointelligence_mlx.sparse_engines.vvor import sparse_vector_vector_outer_product_reduction
    # ... use VVOR for grad_a

def repeat_interleave_indices(repeats):
    """[2, 0, 3, 1] → [0, 0, 2, 2, 2, 3]"""
    offsets = cumsum_exclusive(repeats)
    output_size = int(mx.sum(repeats).item())
    marker = mx.zeros((output_size,), dtype=mx.int32)

    # Set markers at segment boundaries
    valid_mask = repeats > 0
    valid_offsets = offsets[valid_mask_indices]
    marker = marker.at[valid_offsets].add(mx.ones_like(valid_offsets))

    return mx.cumsum(marker) - 1  # 0-indexed segment IDs

PyTorch	MLX	Notes
`nn.Module`	`mlx.nn.Module`
`forward()`	`__call__()`	MLX convention
`nn.Parameter`	Direct attribute	Auto-discovered
`register_buffer`	`self.name = mx.array(...)`	No registration needed
`module.train()/eval()`	No equivalent	Handle BN manually
`nn.Linear(in, out)`	`mlx.nn.Linear(in, out)`	Direct match
`nn.BatchNorm1d`	Custom RaggedBatchNorm	For variable-length batches
`F.relu(x)`	`mlx.nn.relu(x)`
`nn.Sequential`	Custom `MultiSequential`	For `(x, metadata)` tuples

uv venv .venv --python 3.12
uv pip install -e ".[dev]"

[build-system]
requires = ["setuptools>=68.0", "wheel"]
build-backend = "setuptools.build_meta"  # NOT _legacy:_Backend

[project]
dependencies = ["mlx>=0.31.0", "numpy>=1.24.0", "scipy>=1.10.0"]

[project.optional-dependencies]
dev = ["pytest>=7.0", "pytest-benchmark>=4.0", "torch>=2.6.0"]

def check_all_close(mlx_result, reference, atol=1e-5, rtol=1e-5):
    """Compare with relative tolerance scaling (matches PyTorch test patterns)."""
    actual = np.array(mlx_result)
    expected = np.array(reference)  # works with torch.Tensor, np.ndarray, mx.array
    scale = max(1.0, np.abs(expected).max())
    actual_atol = atol * scale
    assert np.allclose(actual, expected, atol=actual_atol, rtol=rtol)

def check_gradient(fn, args, argnums, eps=1e-4, atol=1e-2):
    grad_fn = mx.grad(scalar_fn, argnums=argnums)
    analytical = grad_fn(*args)
    # Central finite differences per element (O(N) evals)
    # For large tensors: random projection check (O(K) evals)

Phase 1 (sequential): Foundation → Testing Harness → Core Utilities
Phase 2 (parallel):   All kernel PRDs simultaneously (they share only Phase 1 deps)
Phase 3 (sequential): Integration layers → Model → Training

Bucket	Source	MLX Replacement	Difficulty
Format conversion	10 CUDA kernels (cuda-conversion.cu)	Vectorized bitwise/arithmetic	EASY
Point cloud	CUDA deproject (cuda-pointcloud.cu)	Precomputed coord grids + broadcast multiply	MEDIUM
Alignment	5 CUDA kernels (cuda-align.cu)	Vectorized matmul + sort-based scatter-min	HARD
Spatial filter	CPU recursive (spatial-filter.cpp)	Row-parallel scan	MEDIUM-HARD
Temporal filter	CPU stateful (temporal-filter.cpp)	Vectorized alpha blend + state mgmt	MEDIUM
Decimation	CPU median/mean (decimation-filter.cpp)	Reshape + mx.median / mx.mean	EASY
Hole filling	CPU neighbor (hole-filling-filter.cpp)	Neighbor gather + min/max	EASY
Disparity	CPU division (disparity-transform.cpp)	Elementwise mx.where + division	TRIVIAL
Colorizer	CPU histogram + LUT (colorizer.cpp)	CPU histogram + MLX LUT gather	EASY-MEDIUM

# Precompute ONCE per intrinsics change (cache)
x_grid = (mx.arange(W) - ppx) / fx  # (W,)
y_grid = (mx.arange(H) - ppy) / fy  # (H,)

# Per frame: broadcast multiply (fast)
z = depth.astype(mx.float32) * depth_scale
X = x_grid[None, :] * z  # (H, W)
Y = y_grid[:, None] * z  # (H, W)
points = mx.stack([X, Y, z], axis=-1)  # (H, W, 3)

# Sort by target index, then take first (min) per segment
order = mx.argsort(target_indices)
sorted_vals = values[order]
# Segment boundaries → numpy fallback for scatter-min
np.minimum.at(output, np.array(sorted_indices), np.array(sorted_vals))

Surface	Status	Notes
Camera capture	✅ Already works	pyrealsense2 + libuvc on macOS
Format converters	PRD-01	10 CUDA kernels → MLX
Point cloud	PRD-02	CUDA deproject → MLX
Depth filters (×5)	PRD-03	CPU/CUDA → MLX
Alignment	PRD-04	CUDA align → MLX
Colorizer	PRD-05	CPU → MLX
Display viewer	PRD-06	OpenCV → SDL2/Metal
Integration tests	PRD-07	Full test suite
Benchmarks	PRD-08	Performance validation

cd realsense-mlx
uv venv .venv --python 3.12
uv pip install -e ".[dev]"
.venv/bin/pytest tests/ -q

Bucket	NVIDIA Original	MLX Replacement
inference backend	CUDA/TensorRT model execution	`mlx_lm.stream_generate()` with sampler API
memory management	CUDA IPC, pinned memory, GPU pools	Unified memory (no copies), `mx.get_active_memory()`
model loading	custom C++ backend loaders	`mlx_lm.load()` → (model, tokenizer)
metrics/profiling	DCGM GPU metrics, NVTX tracing	`mx.get_active/peak/cache_memory()`, prometheus-client
server core	C++ (libevhtp HTTP, gRPC async)	Python-first (FastAPI + uvicorn)
protocols	KServe v2 + OpenAI-compatible	OpenAI-compatible REST (drop-in for any OpenAI SDK)
deployment	Docker/Linux containers	macOS native, `uv pip install -e .`

# OLD API (pre-0.30) — BROKEN, do not use
from mlx_lm.utils import generate_step  # ❌ Moved
generate_step(prompt, model, temp=0.7)  # ❌ temp/top_p kwargs removed

# NEW API (v0.31+) — correct
from mlx_lm import stream_generate, generate
from mlx_lm.sample_utils import make_sampler, make_logits_processors

# Sampling is now via callable sampler
sampler = make_sampler(temp=0.7, top_p=0.9)

# Penalties via logits processors
processors = make_logits_processors(
    repetition_penalty=1.1,
    frequency_penalty=0.5,
    presence_penalty=0.3,
)

# stream_generate yields GenerationResponse dataclass
for response in stream_generate(model, tokenizer, prompt, max_tokens=256,
                                 sampler=sampler, logits_processors=processors):
    print(response.text, end="")
    # response.token, response.logprobs, response.prompt_tokens,
    # response.generation_tokens, response.generation_tps,
    # response.peak_memory, response.finish_reason

# OLD — deprecated, will be removed
mx.metal.get_active_memory()
mx.metal.get_peak_memory()
mx.metal.get_cache_memory()

# NEW — use these
mx.get_active_memory()
mx.get_peak_memory()
mx.get_cache_memory()

async def stream_chat(engine, request):
    completion_id = f"chatcmpl-{uuid4().hex[:12]}"
    created = int(time.time())

    # First chunk: role only
    yield format_sse({"choices": [{"delta": {"role": "assistant"}}]})

    # Content chunks
    async for chunk in engine.generate_stream(...):
        if chunk.text:
            yield format_sse({"choices": [{"delta": {"content": chunk.text}}]})

    # Final chunk: finish_reason
    yield format_sse({"choices": [{"delta": {}, "finish_reason": "stop"}]})
    yield "data: [DONE]\n\n"

def format_sse(obj):
    return f"data: {json.dumps(obj)}\n\n"

def discover_models(repo_path: Path) -> list[ModelSpec]:
    for item in repo_path.iterdir():
        if not item.is_dir() or item.name.startswith("."):
            continue
        config_path = item / "config.json"
        if config_path.exists():
            config = json.loads(config_path.read_text())
            model_type = config.get("model_type", "unknown")
            # Detect quantization from config
            quant = config.get("quantization_config", {}).get("bits")
            yield ModelSpec(name=item.name, path=str(item), ...)

def estimate_memory(config: dict) -> int:
    """Rough memory estimate in bytes."""
    H = config.get("hidden_size", 4096)
    L = config.get("num_hidden_layers", 32)
    I = config.get("intermediate_size", H * 4)
    V = config.get("vocab_size", 32000)
    params_per_layer = 4*H*H + 3*H*I  # attention + MLP
    total_params = L * params_per_layer + 2 * V * H
    bits = config.get("quantization_config", {}).get("bits", 16)
    return int(total_params * (bits / 8) * 1.1)  # 10% overhead

class ConcurrencyLimiter:
    def __init__(self, max_concurrent: int = 4):
        self._semaphore = asyncio.Semaphore(max_concurrent)

    async def acquire(self) -> bool:
        return self._semaphore.locked() is False  # non-blocking check
        # Or: await self._semaphore.acquire() for blocking

Phase 1 (sequential): Foundation + Testing Harness
  - pyproject.toml, project structure, conftest.py
  - Mock fixtures for engine, model manager, tokenizer

Phase 2 (parallel): All modules simultaneously
  - Model Manager (discovery, loading, unloading)
  - Inference Runtime (generate, stream, embed)
  - Engine Adapter (OpenAI ↔ MLX bridge)
  - API Routers (chat, completions, embeddings, models, health)
  - Memory Manager (Metal tracking)
  - Model Repository (config parsing, polling)

Phase 3 (parallel): Production hardening
  - Metrics (Prometheus)
  - Auth (API key middleware)
  - Concurrency (Semaphore limiter)
  - Logging (structured JSON)
  - Error handling (OpenAI error format)
  - Model warm-up and caching

Phase 4 (sequential): Integration + E2E tests

CUDA Kernel	Type	MLX Strategy	Complexity
`preprocessCUDA` (per-Gaussian)	Elementwise	Pure MLX vectorized ops	MEDIUM
`computeCov3D` (quat→cov)	Elementwise	Pure MLX: `M = s[:,None] * R; Sigma = M^T @ M`	EASY
`computeCov2D` (EWA projection)	Elementwise	Pure MLX: Jacobian + matrix product	EASY
`computeColorFromSH` (SH→RGB)	Elementwise	Pure MLX: degree-by-degree eval	EASY
`duplicateWithKeys` + CUB sort	Sort/scatter	`np.lexsort` on (tile_id, depth) pairs	MEDIUM
`identifyTileRanges`	Segment boundaries	`np.diff` + boundary detection	EASY
`renderCUDA` forward (tile alpha blend)	Tile-based with shared mem	NumPy tile loops (Phase 1) → Metal kernel (Phase 2)	HARD
`renderCUDA` backward (reverse alpha)	Tile-based with atomics	NumPy tile loops (Phase 1) → Metal kernel (Phase 2)	HARD

def compute_cov3D_backward(scales, scale_modifier, rotations, dL_dcov3D):
    """Use MLX autodiff through forward — correct by construction."""
    from .projection import compute_cov3D

    def loss_scales(s):
        cov = compute_cov3D(s, scale_modifier, rotations)
        return mx.sum(dL_dcov3D * cov)

    def loss_rotations(r):
        cov = compute_cov3D(scales, scale_modifier, r)
        return mx.sum(dL_dcov3D * cov)

    return mx.grad(loss_scales)(scales), mx.grad(loss_rotations)(rotations)

matrix[0]=m00  matrix[4]=m01  matrix[8]=m02   matrix[12]=m03
matrix[1]=m10  matrix[5]=m11  matrix[9]=m12   matrix[13]=m13
matrix[2]=m20  matrix[6]=m21  matrix[10]=m22  matrix[14]=m23
matrix[3]=m30  matrix[7]=m31  matrix[11]=m32  matrix[15]=m33

def transform_point_4x3(points, matrix):
    m = matrix  # keep flat, do NOT reshape
    x, y, z = points[:, 0], points[:, 1], points[:, 2]
    tx = m[0] * x + m[4] * y + m[8] * z + m[12]
    ty = m[1] * x + m[5] * y + m[9] * z + m[13]
    tz = m[2] * x + m[6] * y + m[10] * z + m[14]
    return mx.stack([tx, ty, tz], axis=-1)

# Expand visible Gaussians to tile-Gaussian pairs
gaussian_ids = np.repeat(visible_ids, tiles_per_gaussian)
expanded_depths = np.repeat(vis_depths, tiles_per_gaussian)

# Sort by (tile_id, depth) — lexsort sorts by last key first
sort_order = np.lexsort((expanded_depths, tile_ids))

# Build per-tile ranges from sorted boundaries
changes = np.concatenate([[0], np.where(np.diff(sorted_tile_ids) != 0)[0] + 1, [total]])
for i in range(len(changes) - 1):
    ranges[sorted_tile_ids[changes[i]]] = (changes[i], changes[i+1])

@mx.custom_function
def rasterize_diff(means3D, colors, opacities, ...):
    # Forward: preprocess → sort → render
    return rendered_image

@rasterize_diff.vjp
def rasterize_vjp(primals, cotangents, output):
    # 1. Re-run forward to get intermediate state (cov, sorted lists, etc.)
    # 2. Render backward (reverse traversal) → dL/dmean2D, dL/dconic, dL/dcolor, dL/dopacity
    # 3. Preprocess backward (use mx.grad for cov3D/SH, manual for projection)
    return (dL_dmeans3D, dL_dcolors, dL_dopacity, dL_dscales, dL_drotations, ...)

settings_packed = mx.array([H, W, tanfovx, tanfovy, scale_mod, sh_deg, use_sh, prefiltered])

Phase	Description	Status
Phase 1	Pure Python/MLX forward pass + numpy renderer	Reference implementation
Phase 2	`@mx.custom_function` VJP + backward pass	Needed for training
Phase 3	Metal kernels for renderer (16x16 tile blocks with shared memory)	Performance
Phase 4	Cross-validation vs CUDA reference + benchmark suite	Quality

# backend.py — single entry point
def detect_device() -> str:
    """MLX > CUDA > MPS > CPU priority."""
    if _is_apple_silicon() and _has_mlx():
        return "mlx"
    if _has_torch_cuda():
        return "cuda:0"
    if _is_apple_silicon() and _has_torch_mps():
        return "mps"
    return "cpu"

def create_clip_encoder(device: str):
    device = resolve_device(device)  # "auto" → concrete
    if device == "mlx":
        from .clip_encoder_mlx import CLIPEncoderMLX
        return CLIPEncoderMLX()
    from .clip_encoder import CLIPEncoder
    return CLIPEncoder(device=device)

Model	MLX Package	Install	API Style
CLIP	`mlx_clip` (harperreed)	`pip install mlx_clip` or `pip install git+https://github.com/harperreed/mlx_clip.git`	`clip.text_encoder(texts)`, `clip.image_encoder(path)`
CLIP	`ml-explore/mlx-examples/clip`	Clone + local import	`model(**inputs)` → `.text_embeds`, `.image_embeds`
SAM 3	`mlx_sam3` (Deekshith-Dade)	`pip install git+https://github.com/Deekshith-Dade/mlx_sam3.git`	`SAM3().segment(image)`
SAM	PyTorch on MPS	`pip install segment-anything`	Standard SAM API on `"mps"` device

mlx_clip.image_encoder() expects a file path, not numpy array — Must save to temp file first:

with tempfile.NamedTemporaryFile(suffix=".png") as f:
    Image.fromarray(image).save(f.name)
    embedding = clip.image_encoder(f.name)

Return types differ — PyTorch CLIP returns torch.Tensor, mlx_clip returns mlx.core.array. Always normalize to numpy at the API boundary:
```
emb_np = np.array(embedding)
norms = np.linalg.norm(emb_np, axis=-1, keepdims=True)
return emb_np / (norms + 1e-8)
```
HuggingFace model IDs for conversion — mlx_clip auto-converts from HF: MLXClip("openai/clip-vit-large-patch14"). First load is slow (weight conversion), subsequent loads are cached.

mlx_sam3 is SAM 3 (not SAM 1) — Different model architecture. The output format may differ from segment_anything.SamAutomaticMaskGenerator. Write an adapter:

class _MLXSam3Generator:
    def generate(self, image):
        results = self._model.segment(image)
        return [{"segmentation": r["mask"].astype(bool),
                 "area": int(r["mask"].sum()),
                 "predicted_iou": r.get("iou", 0.9)} for r in results]

PyTorch SAM on MPS is a strong fallback — On Apple Silicon, segment-anything with model.to("mps") runs well. Use this when mlx_sam3 isn't installed:
```
def _try_load_torch_mps(self):
    model = sam_model_registry[vit_type](checkpoint=ckpt_path)
    model.to("mps")  # Apple GPU via PyTorch
```
SAM checkpoints are PyTorch format (.pth) — Even for MPS fallback, use the same Meta checkpoint URLs. No conversion needed for MPS.

Package	Version	Release Date	PyPI
mlx	0.31.1	2026-02-28	`pip install mlx>=0.31.0`
mlx-metal	0.31.1	2026-02-28	Auto-installed with mlx
mlx-lm	0.31.1	2026-03-11	`pip install mlx-lm>=0.31.0`

Port To MLX

Goals

MANDATORY: Check MLX Version Before Starting

Port To MLX

Goals

MANDATORY: Check MLX Version Before Starting

Latest Known Versions (as of 2026-03-18)

MLX v0.31+ nn Module Inventory (COMPLETE)

Port Formula

IsaacLab Buckets

Stereolabs ZED Buckets

Working Rules

ZED-Specific Working Rules

Default Workflow

ZED Port Recipe

Camera Controls on macOS

Live MLX Stereo Performance

Tune-Stereo Replacement

Port Completeness Matrix

Task Port Recipe (IsaacLab)

Locomotion tasks with terrain or raycasts

Manipulation tasks

Semantic drift and benchmark contracts

High-Value Targets

IsaacLab

ZED Stereo

Validation Commands

IsaacLab

ZED Stereo

Learned Patterns

Triton/CUDA Kernel → MLX Metal Porting (Pointelligence Pattern)

Kernel Porting Buckets

MLX API Quick Reference

The Blocked Scatter-Add Pattern (CRITICAL)

Custom VJP Pattern for Sparse Ops

repeat_interleave Pattern (Ragged Tensor Core)

nn.Module Translation Cheatsheet

Packaging with uv (Not pip)

pyproject.toml Pattern

Testing Pattern: Cross-Framework Validation

Gradient Check (No torch.autograd.gradcheck)

MLX Gotchas (Learned the Hard Way)

Multi-Agent Porting Workflow

PRD-Driven Port Structure

Pointelligence Port Stats (Reference)

Intel RealSense → MLX Port (realsense-mlx)

Architecture Decision

RealSense Port Buckets

Key Pattern: Precomputed Coordinate Grids

Key Pattern: Sort-Based Scatter-Min (Alignment)

Key Pattern: Recursive Filter (Spatial)

RealSense-Specific Gotchas

RealSense Port Completeness Matrix

Validation Commands (RealSense)

Triton Inference Server → MLX Port (Server/Service Pattern)

Triton Port Buckets

Key Architectural Decision: Python-First

mlx-lm API (v0.31+) — Critical Breaking Changes

mlx.metal.* Deprecation (MLX v0.22+)

Server Port Recipe

OpenAI-Compatible SSE Streaming Pattern

Model Repository Discovery Pattern

Model Memory Estimation (Before Loading)

Concurrency Control for Apple Silicon

Multi-Agent Porting Workflow (Server Pattern)

Triton-MLX Port Stats (Reference)

Triton-MLX Gotchas (Learned the Hard Way)

3D Gaussian Splatting Rasterizer → MLX Port (diff-gaussian-rasterization)

3DGS Kernel Porting Buckets

Key Pattern: GLM Convention Avoidance

Key Pattern: Column-Major Matrix Indexing

Key Pattern: Tile-Based Sorting Without CUB

Key Pattern: @mx.custom_function for Rasterization VJP

Key Pattern: Hybrid Backward (Manual Render + Autodiff Preprocess)

3DGS Port Phases

3DGS-Specific Gotchas

3DGS Port Stats (Reference)

References

Vision Foundation Models → MLX Port (SAM + CLIP Pattern)

Architecture: Backend Factory Pattern