Name: Add Archon Model
Author: inclusionAI

스킬 검색.../

Add Archon Model | Skills Pool

Target model: <name>
HF model_type: "<model_type>" (and variants like "<model_type>_moe" if applicable)
Attention: [standard GQA / with QK norm / with bias / sliding window / ...]
FFN: [SwiGLU / GeGLU / standard MLP / ...]
MoE: [no / yes - num_experts, top_k, shared_experts]
RoPE: [standard / YaRN / NTK-aware / ...]
Norm: [RMSNorm / LayerNorm] with [pre-norm / post-norm]
Weight tying: [yes / no]

areal/experimental/models/archon/<model>/
  __init__.py
  spec.py
  model/
    args.py
    model.py
    rope.py
    state_dict_adapter.py
  infra/
    parallelize.py

@dataclass
class <Model>ModelArgs(BaseModelArgs):
    # ... model-specific fields ...

    @classmethod
    def from_hf_config(
        cls,
        hf_config: PretrainedConfig,
        is_critic: bool = False,
        **kwargs,
    ) -> <Model>ModelArgs:
        # Map HF config fields to Archon model args
        ...

class <Model>Model(BaseArchonModel):
    def forward(self, tokens, positions, cu_seqlens, max_seqlen, tree_attn_meta=None) -> torch.Tensor: ...
    def init_weights(self) -> None: ...
    def init_buffers(self, buffer_device) -> None: ...

Standard RoPE (same as qwen2/qwen3): Re-export from qwen2:

from areal.experimental.models.archon.qwen2.model.rope import (
    apply_rotary_emb,
    precompute_rope_cache,
    repeat_kv,
    reshape_for_broadcast,
    rotate_half,
)

Custom RoPE (YaRN, NTK-aware, etc.): Implement custom precompute_rope_cache() and apply_rotary_emb() functions. The key difference is usually in how inv_freq is computed (scaling factors, interpolation, etc.).

# Roundtrip: archon -> hf -> archon preserves all keys
hf_sd = adapter.to_hf(archon_sd)
roundtrip_sd = adapter.from_hf(hf_sd)
assert set(roundtrip_sd.keys()) == set(archon_sd.keys())

class <Model>StateDictAdapter(BaseStateDictAdapter):
    def from_hf(self, hf_state_dict) -> dict[str, Any]: ...
    def to_hf(self, archon_state_dict) -> dict[str, Any]: ...
    def convert_single_to_hf(self, name, tensor) -> list[tuple[str, torch.Tensor]]: ...

def parallelize_<model>(
    model: nn.Module,
    parallel_dims: ArchonParallelDims,
    param_dtype: torch.dtype = torch.bfloat16,
    reduce_dtype: torch.dtype = torch.float32,
    loss_parallel: bool = True,
    cpu_offload: bool = False,
    reshard_after_forward_policy: str = "default",
    ac_config: ActivationCheckpointConfig | None = None,
    enable_compile: bool = True,
) -> nn.Module:

from areal.experimental.models.archon.model_spec import ModelSpec, register_model_spec
from areal.experimental.models.archon.pipeline_parallel import pipeline_llm
from areal.experimental.models.archon.<model>.infra.parallelize import parallelize_<model>
from areal.experimental.models.archon.<model>.model.args import <Model>ModelArgs
from areal.experimental.models.archon.<model>.model.model import <Model>Model
from areal.experimental.models.archon.<model>.model.state_dict_adapter import (
    <Model>StateDictAdapter,
)

<MODEL>_SPEC = ModelSpec(
    name="<Model>",
    model_class=<Model>Model,
    model_args_class=<Model>ModelArgs,
    state_dict_adapter_class=<Model>StateDictAdapter,
    parallelize_fn=parallelize_<model>,
    supported_model_types=frozenset({"<model_type>"}),  # From HF config.json
    pipelining_fn=pipeline_llm,
)

# Auto-register when module is imported
register_model_spec(<MODEL>_SPEC)

__all__ = ["<MODEL>_SPEC"]

from areal.experimental.models.archon.<model> import spec as <model>_spec  # noqa: F401

Target characteristics	Reference	Why
Dense-only, standard GQA, no QK norm	`qwen2`	Simplest baseline, pure dense
Has QK norm, or has MoE support	`qwen3`	Supports QK norm + MoE + shared experts

Add Archon Model

When to Use

Prerequisites

Step-by-Step Guide

Step 1: Analyze the Target Model Architecture

Add Archon Model

When to Use

Prerequisites

Step-by-Step Guide

Step 1: Analyze the Target Model Architecture

Step 2: Select the Reference Model

Step 3: Implement `args.py`

Step 4: Implement `model.py`

Step 5: Implement `rope.py`

Step 6: Implement `state_dict_adapter.py`

Step 7: Implement `parallelize.py`

Step 8: Create `spec.py` and Register

Step 9: Register in `init.py`

Step 10: Verify and Test

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns

Add Archon Model

When to Use

Prerequisites

Step-by-Step Guide

Step 1: Analyze the Target Model Architecture

Add Archon Model

When to Use

Prerequisites

Step-by-Step Guide

Step 1: Analyze the Target Model Architecture

Step 2: Select the Reference Model

Step 3: Implement args.py

Step 4: Implement model.py

Step 5: Implement rope.py

Step 6: Implement state_dict_adapter.py

Step 7: Implement parallelize.py

Step 8: Create spec.py and Register

Step 9: Register in __init__.py

Step 10: Verify and Test

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns

Step 3: Implement `args.py`

Step 4: Implement `model.py`

Step 5: Implement `rope.py`

Step 6: Implement `state_dict_adapter.py`

Step 7: Implement `parallelize.py`

Step 8: Create `spec.py` and Register

Step 9: Register in `init.py`