End-to-end guide for implementing, testing, and optimizing neural network operators in the ESP-DL framework. Covers C++ module implementation, C reference kernels, SIMD assembly optimization, esp-ppq quantization strategy integration, Docker-based build/test, and inference result alignment between esp-dl and esp-ppq. Use this skill whenever the user wants to add a new operator, implement an operator, optimize an existing operator with SIMD, add quantization support for an operator, or test/validate operator correctness. Also triggers for "算子实现", "添加算子", "SIMD优化", "量化支持", "算子对齐" and similar phrases.
This skill guides you through the complete lifecycle of implementing a neural network operator in ESP-DL: from C++ module code, through quantization support in esp-ppq, to Docker-based validation that ensures inference results align between the quantization tool and the on-device runtime.
This skill describes a multi-phase pipeline: research → implement → test → optimize → document. The most critical transition is from code modification (Phases 2–5) to testing (Phase 6).
After completing ANY code change — whether it's a new module, a base layer fix, an esp-ppq tweak, or a test config update — immediately proceed to Phase 6 (Docker Build & Test) without stopping to ask the user. The user expects the full implement-then-test cycle to happen as one continuous flow. Pausing after code changes to ask "should I run tests now?" breaks the workflow and forces unnecessary back-and-forth.
The only reasons to pause before testing are:
When multiple code files need modification, complete ALL code changes first (Phases 2–5 as applicable), then run the full test pipeline once. Don't test after each individual file change.
esp-dl/esp-dl/
├── dl/module/include/dl_module_<op>.hpp # Module layer: interface, shape inference, forward dispatch
├── dl/module/include/dl_module_creator.hpp # Operator registry
├── dl/base/dl_base_<op>.hpp/.cpp # Base layer: C reference impl + ISA dispatch
├── dl/base/isa/tie728/dl_tie728_*.S # TIE728 SIMD (ESP32-S3)
├── dl/base/isa/esp32p4/dl_esp32p4_*.S # ESP32-P4 SIMD (RISC-V PIE)
esp-dl/tools/ops_test/
├── config/op_cfg.toml # Test configurations per operator
├── torch_ops_test.py # PyTorch-based test model builders
├── onnx_ops_test.py # ONNX-based test model builders
├── gen_test_cases.py # Generates quantized .espdl test models
esp-dl/test_apps/esp-dl/ # Test application (builds + runs on hardware)
esp-ppq/esp_ppq/
├── quantization/quantizer/EspdlQuantizer.py # Quantization config per op type
├── parser/espdl/espdl_typedef.py # Op set classifications
├── parser/espdl/export_patterns.py # Export pattern rules (LUT, layout, fusion)
├── IR/base/opdef.py # OpSocket definitions for dispatch
├── executor/op/torch/espdl.py # LUT computation backend
Before writing any code, understand what you're building.
Look up the operator at https://onnx.ai/onnx/operators/onnx__<OpName>.html.
Understand its inputs, outputs, attributes, broadcasting rules, and edge cases.
The classification determines which templates and patterns to follow:
| Category | Examples | Module Pattern | Base Pattern |
|---|---|---|---|
| Elementwise binary | Add, Sub, Mul, Div, Mod, Pow | dl_module_add.hpp | dl_base_add.hpp/cpp (elemwiseArgsType) |
| Elementwise unary | Relu, Sigmoid, Exp, Neg, Sqrt | dl_module_relu.hpp | dl_base_relu.hpp/cpp (ArgsType) |
| Convolution-like | Conv, ConvTranspose, DepthwiseConv | dl_module_conv.hpp | dl_base_conv2d.hpp/cpp |
| Pooling | AveragePool, MaxPool, GlobalAveragePool | dl_module_average_pool.hpp | dl_base_avg_pool2d.hpp/cpp |
| Reduce | ReduceSum, ReduceMean, ReduceMax | dl_module_reduce_sum.hpp | dl_base_reduce.hpp/cpp |
| Shape manipulation | Reshape, Transpose, Flatten, Slice | dl_module_reshape.hpp | Typically no base layer needed |
| Sequence/RNN | GRU, LSTM | dl_module_gru.hpp | Complex multi-step base |
| Activation (LUT) | HardSwish, HardSigmoid, Tanh | dl_module_lut.hpp | LUT-based implementation |
Read 2-3 reference implementations from the same category. The references directory at
references/esp-dl-templates.md has annotated templates for each category.
Decide which data types to support: int8, int16, float32.
Default rule: ALL operators MUST implement float32 unless technically impossible. Float32 serves as the high-precision inference path — it preserves full model accuracy without quantization loss and is the baseline for correctness validation. Every operator that can accept float inputs and produce float outputs should support float32, regardless of whether it is "compute-heavy" or "typically run quantized". Conv, ConvTranspose, MatMul, Linear, and all other operators should include float32 support.
When float32 is appropriate (the vast majority of operators):
The only exceptions where float32 may be omitted:
If you are unsure whether an operator should support float32, the answer is yes, it should. Only omit float32 when the operator's semantics make float input/output meaningless.
Float32 implementation is generally simpler than quantized: no scale/rescale, no truncation, no exponent handling, and no SIMD optimization needed.
Create esp-dl/dl/module/include/dl_module_<op_snake>.hpp where <op_snake> is the
snake_case version of the ONNX operator name (e.g., HardSwish → hard_swish).
Every operator module must:
Module (defined in dl_module_base.hpp)get_output_shape() — compute output shape from input shapesforward() — dispatch to the correct typed forward_template<T>()forward_template<T>() — get tensors from context, prepare args, call base layerdeserialize() — reconstruct from FlatBuffers modelforward_args() — for dual-core dispatch supportprint() — debug infoKey conventions:
#pragma once as header guarddl::modulename, inplace, quant_type at minimumquant_type dispatches to QUANT_TYPE_SYMM_8BIT, QUANT_TYPE_SYMM_16BIT, QUANT_TYPE_FLOAT32See references/esp-dl-templates.md for full annotated templates.
The deserialize() static method reads attributes from FlatBuffers:
static Module *deserialize(fbs::FbsModel *fbs_model, std::string node_name)
{
Module *op = nullptr;
quant_type_t quant_type;
fbs_model->get_operation_attribute(node_name, "quant_type", quant_type);
// Read operator-specific attributes
int some_attr;
fbs_model->get_operation_attribute(node_name, "some_attr", some_attr);
op = new MyOp(node_name.c_str(), MODULE_NON_INPLACE, quant_type, some_attr);
return op;
}
Add the operator to dl_module_creator.hpp in the register_dl_modules() method:
this->register_module("MyOp", MyOp::deserialize);
Also add the #include "dl_module_<op_snake>.hpp" at the top of the creator header.
→ Continue to Phase 3 if a base layer is needed, or skip to Phase 4 (esp-ppq) if this is a shape-only op. After all code phases are done, proceed directly to Phase 6 for testing.
The base layer provides the actual computation kernel. Create:
esp-dl/dl/base/dl_base_<op_snake>.hpp — declarationsesp-dl/dl/base/dl_base_<op_snake>.cpp — C reference implementationModule::forward_template<T>()
→ prepares ArgsType / elemwiseArgsType
→ calls base::<op_function>(args)
→ selects ISA-optimized or C reference impl
→ executes the kernel
For elementwise binary ops, use elemwiseArgsType<T> and the elemwise_loop_*d() helpers.
For unary ops, use ArgsType<T> and the activation_shell() helper.
For other ops, define a custom args struct.
In the .cpp file, the implementation selection follows this pattern:
#if CONFIG_ESP32P4_BOOST
impl_func = dl_esp32p4_s8_<op>_11c; // P4 SIMD
#elif CONFIG_TIE728_BOOST
impl_func = dl_tie728_s8_<op>_11c; // S3 SIMD
#else
impl_func = c_impl_<op>; // C reference (always present)
#endif
The C reference implementation is the fallback and must always exist. SIMD implementations are added as a later optimization step.
Float32 kernels are fundamentally simpler than int8/int16 quantized kernels because there is no quantization overhead. Here are the key differences:
| Aspect | int8 / int16 (quantized) | float32 |
|---|---|---|
| Arithmetic | tool::truncate<int32_t>(result) — clamp to type range | Direct arithmetic, no truncation |
| Scale/Rescale | Uses args->mul_shift, input_scale, output_rescale | Ignores these fields (exponent=0, scale=1.0) |
| SIMD dispatch | ISA-specific implementations (TIE728, ESP32-P4) | C reference only — no SIMD needed |
| Template specialization | Generic template handles quantization math | Explicit template<> specialization for float |
Two patterns for float implementation:
Pattern A — Base layer with float specialization (recommended for binary/complex ops):
The base layer .cpp provides a template<> ... <float> specialization that does direct
arithmetic. The module calls the same base::op(args) function for all types. This pattern
keeps the module layer clean and uniform. See dl_base_add.cpp for example.
Pattern B — Module-level inline implementation (acceptable for simple unary ops):
Some simple unary ops (like ReLU) implement float32 directly in the module's forward()
method without calling the base layer. This avoids creating a base-layer float overload for
trivial operations. The float path is a simple loop over elements.
// Pattern B example: float implemented directly in forward()
void forward(ModelContext *context, runtime_mode_t mode)
{
if (quant_type == QUANT_TYPE_SYMM_8BIT) {
forward_template<int8_t>(context, mode);
} else if (quant_type == QUANT_TYPE_SYMM_16BIT) {
forward_template<int16_t>(context, mode);
} else if (quant_type == QUANT_TYPE_FLOAT32) {
TensorBase *input = context->get_tensor(m_inputs_index[0]);
TensorBase *output = context->get_tensor(m_outputs_index[0]);
float *input_ptr = (float *)input->get_element_ptr();
float *output_ptr = (float *)output->get_element_ptr();
for (size_t i = 0; i < input->size; i++) {
output_ptr[i] = /* direct float operation on input_ptr[i] */;
}
}
}
Use Pattern A when the operation has multiple broadcast variants, multi-dimensional looping, or dual-core dispatch. Use Pattern B only for straightforward element-by-element operations with a single input.
See references/esp-dl-templates.md for complete float32 template examples.
→ Continue to Phase 4 (esp-ppq checks). Do not stop here — esp-ppq modifications and test configuration (Phases 4–5) are prerequisites for testing.
Every new operator needs at least TWO checks in esp-ppq, because the export pipeline has two independent systems that must both recognize the operator:
quant_operation_types determines if an op gets quantizedlayout_patterns in layout_patterns.py handles NCHW→NHWC
transformation. Every operator in the graph MUST be in one of the layout pattern op sets,
otherwise reset_graph_layout() will error with "Can not reset {op_type} layout"Important: Float32 and esp-ppq. When float=True is passed to the quantization API,
the entire graph uses TargetPlatform.FP32 and skips quantization entirely — the model is
loaded from ONNX and exported directly without calling EspdlQuantizer. This means:
InsertQuantTypePattern
sets quant_type = EspQuantType.F32 for all ops, and patterns like ResetParamLayoutPattern
and AddLUTPattern short-circuit when they see quant_type == F32reset_graph_layout()
runs for both quantized and float32 exportsCheck EspdlQuantizer.quant_operation_types in
esp-ppq/esp_ppq/quantization/quantizer/EspdlQuantizer.py.
If the operator is NOT listed → add it.
Check esp-ppq/esp_ppq/parser/espdl/espdl_typedef.py — the operator MUST be in one of
these op sets, which map to layout transformation patterns in layout_patterns.py:
Op Set in espdl_typedef.py | Layout Pattern | When to Use |
|---|---|---|
CONV_LAYOUT_OP_SET | ResetConvLayoutPattern | Conv, Pool, DepthToSpace — ops with spatial layout |
PASSIVE_LAYOUT_OP_SET | BypassPassiveLayoutPattern | Activations (Relu, Sigmoid...) + Math (Exp, Log...) — pass through layout |
ADD_LIKE_OP_SET | BypassAddLikePattern | Binary elementwise (Add, Sub, Mul, Div, Mod, Pow...) — handles shape broadcasting between two inputs |
AXIS_TRANSFORM_OP_SET | AxisTransformPattern | Softmax, Split, Reduce ops — transforms axis attributes |
OTHER_OP_SET | RestoreOriginLayoutPattern | Reshape, Transpose, Gather, GRU... — restores to original layout |
The BypassAddLikePattern is particularly important for binary elementwise ops: it ensures
that when the two inputs have different permutations (due to upstream layout changes), the
pattern either propagates the permutation consistently or inserts a transpose to fix the
mismatch. Without this, binary ops will produce incorrect results after layout transformation.
If the operator is NOT in any op set → add it to the correct one. Even if the operator
is already in quant_operation_types, a missing op set entry will cause export failure.
Most operators use the default quantization config. Special rules are needed when:
ACTIVATION_OP_SET in espdl_typedef.py
and AddLUTPattern in export_patterns.py handles itMost operators use DEFAULT_SOCKET_CREATOR. Custom sockets are needed when inputs
have different platform requirements (e.g., Gather's index input stays FP32).
Check DEFAULT_SOCKET_TABLE in esp-ppq/esp_ppq/IR/base/opdef.py.
Beyond the layout patterns above, check export_patterns.py for:
FuseReluLikePattern)| Check | File | Action |
|---|---|---|
In quant_operation_types? | EspdlQuantizer.py | Add if missing |
| In a layout op set? | espdl_typedef.py | Always verify — add to correct op set |
| Special quant config? | EspdlQuantizer.py | Add rules in create_espdl_quant_config() if needed |
| Custom OpSocket? | IR/base/opdef.py | Add if inputs have heterogeneous platform needs |
| Export patterns? | export_patterns.py | Add if LUT/fusion/weight-layout needed |
| Operator Category | Add to Op Set | Why |
|---|---|---|
| Elementwise binary (Add-like) | ADD_LIKE_OP_SET | BypassAddLikePattern handles input shape broadcasting |
| Elementwise unary (activation) | ACTIVATION_OP_SET | BypassPassiveLayoutPattern passes through layout |
| Elementwise unary (math) | MATH_OP_SET | Also covered by PASSIVE_LAYOUT_OP_SET |
| Convolution-like | CONV_LAYOUT_OP_SET | ResetConvLayoutPattern transforms spatial layout |
| Reduce / Softmax-like | REDUCE_OP_SET or SOFTMAX_LIKE_OP_SET | AxisTransformPattern adjusts axis attrs |
| Shape manipulation | OTHER_OP_SET | RestoreOriginLayoutPattern restores original |
→ Continue to Phase 5 to configure test cases. Test configuration is the last step before the actual build & test pipeline.
If PyTorch has the operator, add a test class in tools/ops_test/torch_ops_test.py:
class MYOP_TEST(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
# Initialize PyTorch op from config params
def forward(self, *inputs):
# Compute forward pass
return output
If PyTorch doesn't have it (or ONNX-only), add a function in tools/ops_test/onnx_ops_test.py:
def MYOP_TEST(config) -> onnx.ModelProto:
# Build ONNX graph using onnx.helper
return model
Add to tools/ops_test/config/op_cfg.toml:
[ops_test.MyOp]
test_func = "MYOP_TEST"
quant_bits = ["int8", "int16", "float32"]
package = "torch_ops_test" # or "onnx_ops_test"
targets = ["esp32s3", "esp32p4"]
[[ops_test.MyOp.cfg]]
input_shape = [1, 16, 32, 32]
export_name_prefix = "myop_basic_test"
# operator-specific parameters...
[[ops_test.MyOp.cfg]]
input_shape = [1, 3, 8, 8]
export_name_prefix = "myop_edge_case"
# different parameters for edge cases...
quant_bits field: Controls which quantization types to generate test cases for.
"int8" → generates *_s8.espdl test models (quantized to 8-bit)"int16" → generates *_s16.espdl test models (quantized to 16-bit)"float32" → generates *_f32.espdl test models (no quantization, direct float)All operators should include all three: ["int8", "int16", "float32"].
Only omit "float32" for the rare ops where float output is meaningless (e.g., comparison ops that output boolean).
Create 3-5 test configurations covering:
→ All code changes are now complete. Proceed IMMEDIATELY to Phase 6 to build and test. Do not stop to ask the user — go straight into the Docker build & test pipeline.
This phase should run automatically after any code modification — do not wait for user confirmation to start. If you just completed Phases 2–5 (or any subset of them), execute the steps below immediately. The user expects code changes to be validated, not left untested.
All build and test commands run inside a Docker container. The Docker image
espdl/idf-ppq:latest contains ESP-IDF v5.4.3, PyTorch, and esp-ppq.
Every Docker command uses the same base template. Define these variables first,
then use the DOCKER_RUN function for all operations:
# ========== Configuration (set these once) ==========
OP_TYPE="MyOp" # ONNX operator name (PascalCase)
TARGET="esp32p4" # Target chip: esp32p4, esp32s3, esp32
ESP_DL_ROOT="/path/to/esp-dl" # esp-dl project root
ESP_PPQ_ROOT="/path/to/esp-ppq" # esp-ppq root (optional, for editable mode)
ESP_DL_IMAGE="espdl/idf-ppq:latest" # Docker image name
SKILL_DIR="/path/to/skills/espdl-operator" # Skill base directory (the directory containing SKILL.md).
# The agent should resolve this from its skill load path.
# ========== Auto-build Docker image if missing ==========
# The Dockerfile lives at assets/docker/Dockerfile inside this skill's directory.
# Building takes 20-30 min (downloads ESP-IDF + PyTorch). Only runs once.
if ! docker image inspect "${ESP_DL_IMAGE}" > /dev/null 2>&1; then
echo "Docker image ${ESP_DL_IMAGE} not found. Building (this may take 20-30 minutes)..."
docker build -t "${ESP_DL_IMAGE}" "${SKILL_DIR}/assets/docker"
if [ $? -ne 0 ]; then
echo "ERROR: Failed to build Docker image ${ESP_DL_IMAGE}. Fix Dockerfile issues and retry."
return 1 2>/dev/null || exit 1
fi
echo "Docker image ${ESP_DL_IMAGE} built successfully."
fi
# ========== Docker base command builder ==========
DOCKER_BASE="docker run --rm -i -v ${ESP_DL_ROOT}:/esp-dl -w /esp-dl"
if [ -n "${ESP_PPQ_ROOT}" ] && [ -d "${ESP_PPQ_ROOT}" ]; then
DOCKER_BASE="${DOCKER_BASE} -v ${ESP_PPQ_ROOT}:/esp-ppq"
PPQ_INSTALL="pip install -e /esp-ppq[cpu] > /dev/null 2>&1"
else
PPQ_INSTALL="pip install esp-ppq > /dev/null 2>&1"
fi
DOCKER_PREAMBLE=". \$IDF_PATH/export.sh && ${PPQ_INSTALL}"
Important: The auto-build block above MUST be included whenever Phase 6 commands are
executed. It is idempotent — if the image already exists, docker image inspect succeeds
instantly and the build is skipped. On first run, the build pulls espressif/idf:v5.4.3 as
the base image and installs PyTorch + other dependencies, which takes 20-30 minutes.
If the build fails (e.g., network issues), the script exits early with an error message
so subsequent Docker commands don't fail with a confusing "image not found" error.
Generates .espdl model files with embedded test values:
# Generate int8 test cases (quantized, produces *_s8.espdl)
${DOCKER_BASE} ${ESP_DL_IMAGE} bash -c "${DOCKER_PREAMBLE} && \
python tools/ops_test/gen_test_cases.py \
--config tools/ops_test/config/op_cfg.toml \
--ops ${OP_TYPE} \
--output-path test_apps/esp-dl/models/${TARGET} \
--target ${TARGET} \
--bits 8"
# Generate int16 test cases (quantized, produces *_s16.espdl)
${DOCKER_BASE} ${ESP_DL_IMAGE} bash -c "${DOCKER_PREAMBLE} && \
python tools/ops_test/gen_test_cases.py \
--config tools/ops_test/config/op_cfg.toml \
--ops ${OP_TYPE} \
--output-path test_apps/esp-dl/models/${TARGET} \
--target ${TARGET} \
--bits 16"
# Generate float32 test cases (no quantization, produces *_f32.espdl)
${DOCKER_BASE} ${ESP_DL_IMAGE} bash -c "${DOCKER_PREAMBLE} && \
python tools/ops_test/gen_test_cases.py \
--config tools/ops_test/config/op_cfg.toml \
--ops ${OP_TYPE} \
--output-path test_apps/esp-dl/models/${TARGET} \
--target ${TARGET} \
--float"
Note: --bits 8 and --bits 16 control quantized test generation. The --float flag
(not --bits 32) triggers float32 test generation. Float32 test cases are only generated
when "float32" is present in the operator's quant_bits in op_cfg.toml.
Compiles the esp-dl test app with the operator's model data:
${DOCKER_BASE} ${ESP_DL_IMAGE} bash -c "${DOCKER_PREAMBLE} && \
python test_apps/build_apps.py test_apps/esp-dl \
-op ${OP_TYPE} -t ${TARGET} -vv"
Creates the pytest file for the specific operator:
${DOCKER_BASE} ${ESP_DL_IMAGE} bash -c "${DOCKER_PREAMBLE} && \
python test_apps/esp-dl/gen_op_test.py \
--target ${TARGET} --env ${TARGET} \
--op_type ${OP_TYPE} \
--pytest_file test_apps/esp-dl/pytest_espdl_op.py"
Before flashing, always detect the serial port programmatically — never assume the device
is disconnected or ask the user without checking first. Run the detection command below
and inspect the output. If it returns one or more device paths (e.g. /dev/ttyUSB0), the
device IS connected — proceed directly to flashing. Only if the command returns empty output
should you inform the user that no device was found.
# Step A: Detect serial port (ALWAYS run this first, don't skip)
ls /dev/ttyUSB* /dev/ttyACM* 2>/dev/null
# Step B: Set the port and flash (only after Step A confirms a device exists)
SERIAL_PORT=$(ls /dev/ttyUSB* /dev/ttyACM* 2>/dev/null | head -1)
${DOCKER_BASE} --device ${SERIAL_PORT} --group-add dialout \
${ESP_DL_IMAGE} bash -c "${DOCKER_PREAMBLE} && \
pytest test_apps/esp-dl/pytest_espdl_op.py \
--target ${TARGET} --env ${TARGET} \
--model ${OP_TYPE} -v"
Common device paths on Linux: /dev/ttyUSB0, /dev/ttyUSB1, /dev/ttyACM0.
If multiple ports exist, the first one (head -1) is usually correct for single-board setups.
For JTAG+UART dual-port setups (common on ESP32-P4 devkits), both /dev/ttyUSB0 and
/dev/ttyUSB1 may appear — the higher-numbered port is typically the UART console.
For convenience, you can chain Steps 1-3 in a single Docker run (excluding hardware test which needs device access):
${DOCKER_BASE} ${ESP_DL_IMAGE} bash -c "${DOCKER_PREAMBLE} && \
python tools/ops_test/gen_test_cases.py \
--config tools/ops_test/config/op_cfg.toml \
--ops ${OP_TYPE} \
--output-path test_apps/esp-dl/models/${TARGET} \
--target ${TARGET} --bits 8 && \
python tools/ops_test/gen_test_cases.py \
--config tools/ops_test/config/op_cfg.toml \
--ops ${OP_TYPE} \
--output-path test_apps/esp-dl/models/${TARGET} \
--target ${TARGET} --bits 16 && \
python tools/ops_test/gen_test_cases.py \
--config tools/ops_test/config/op_cfg.toml \
--ops ${OP_TYPE} \
--output-path test_apps/esp-dl/models/${TARGET} \
--target ${TARGET} --float && \
python test_apps/build_apps.py test_apps/esp-dl \
-op ${OP_TYPE} -t ${TARGET} -vv"
The test framework works by:
.espdl models with export_test_values=True.espdlequal(output, expected, tolerance=2e-5), int16 allows ±1 errorDifferences by quantization type:
| Type | Model suffix | Tolerance | Common failure causes |
|---|---|---|---|
| int8 | *_s8.espdl | Strict (2e-5) | Quantization config mismatch, rounding, exponent calculation |
| int16 | *_s16.espdl | ±1 allowed | Similar to int8, but wider range means fewer edge cases |
| float32 | *_f32.espdl | 2e-5 | Usually data layout (NCHW vs NHWC), or missing float specialization |
Float32 tests are the easiest to debug because there's no quantization involved — if a float32 test fails, the issue is almost certainly in the computation logic itself or the data layout, not in scale/exponent handling. Start debugging with float32 tests first.
When all tests pass on all targets: proceed to Phase 9 to update operator_support_state.md.
This is not optional — the operator documentation must stay in sync with the test config.
Do not consider the task complete until Phase 9 is done.
If tests fail, check:
float template specialization — float32 onlyAfter the C reference implementation passes all tests, you can add SIMD-optimized kernels. This is optional and should only be done when performance matters.
SIMD optimization is worthwhile for:
Skip SIMD for:
EE.VLD.128.IP, EE.VRELU.S8, EE.VSMULAS.S8.QACCAssembly files go in:
dl/base/isa/tie728/dl_tie728_<dtype>_<op>.Sdl/base/isa/esp32p4/dl_esp32p4_<dtype>_<op>.Sdl_tie728_s8_<op>_11c — aligned, int8, TIE728dl_tie728_s8_unaligned_<op>_11c — unaligned variantdl_esp32p4_s8_<op>_11c — aligned, int8, ESP32-P4extern "C"dl_base_<op>.cpp to use the SIMD function.section .iram1 — this forces the function into IRAM which is a scarce
resource on ESP chips. IRAM is needed for interrupt handlers and critical system code.
Let the linker place functions in flash by default; use .text section only.
(You may see .section .iram1 in some older files, but it's being phased out.)a2=output_ptr, a3=input_ptr, a4=args_structSee references/esp-dl-templates.md for SIMD template examples.
The alignment between esp-dl and esp-ppq is verified through the test framework:
export_test_values=True in gen_test_cases.py causes the quantized
forward pass results to be embedded in the .espdl model fileModel::test() in dl_model_base.cpp loads these values, runs inference,
and compares outputsIf alignment fails after all individual steps pass:
espdl_typedef.py op set classification matches esp-dl's behaviorexecutor/op/torch/espdl.py matchesThis phase is mandatory — execute it immediately after all tests pass. The operator is
not considered fully delivered until operator_support_state.md reflects the new operator.
Skipping this step leaves the public documentation out of sync with the actual capabilities.
The script tools/ops_test/gen_ops_markdown.py reads op_cfg.toml and produces a markdown
table listing each operator with its supported quantization types and restrictions. The generated
file operator_support_state.md lives in the esp-dl root directory and serves as the public
reference for which operators are available.
Run from the esp-dl project root (this can run outside Docker — it only reads op_cfg.toml):
cd ${ESP_DL_ROOT}
uv run --with toml --with tabulate \
python tools/ops_test/gen_ops_markdown.py \
-c tools/ops_test/config/op_cfg.toml \
-o .
After running, verify the diff looks correct — the new operator should appear in the table
with the right quantization type checkmarks and any restrictions you configured in op_cfg.toml.
Show the user the relevant diff so they can confirm the documentation update.
For a new operator MyOp:
esp-dl/dl/module/include/dl_module_<op>.hpp — Module class (NEW)esp-dl/dl/base/dl_base_<op>.hpp — Base layer header (NEW, if computation needed)esp-dl/dl/base/dl_base_<op>.cpp — Base layer impl (NEW, if computation needed)esp-dl/dl/module/include/dl_module_creator.hpp — Register deserialize (MODIFY)tools/ops_test/torch_ops_test.py or onnx_ops_test.py — Test builder (MODIFY)tools/ops_test/config/op_cfg.toml — Test config (MODIFY)esp_ppq/quantization/quantizer/EspdlQuantizer.py — Verify op is in quant_operation_types (add if missing)esp_ppq/parser/espdl/espdl_typedef.py — Verify op is in correct layout op set (add if missing — export WILL fail otherwise)esp_ppq/quantization/quantizer/EspdlQuantizer.py — Special quant rules in create_espdl_quant_config() (if needed)esp_ppq/parser/espdl/export_patterns.py — Export patterns: LUT, fusion, weight layout (if needed)esp_ppq/IR/base/opdef.py — Custom OpSocket (if needed)esp-dl/dl/base/isa/tie728/dl_tie728_<dtype>_<op>.S — TIE728 assemblyesp-dl/dl/base/isa/esp32p4/dl_esp32p4_<dtype>_<op>.S — P4 assemblyesp-dl/dl/base/dl_base_<op>.cpp — Update ISA dispatch--bits 8)--bits 16)--float)gen_ops_markdown.py to regenerate operator_support_state.md (uv run --with toml --with tabulate python tools/ops_test/gen_ops_markdown.py -c tools/ops_test/config/op_cfg.toml -o .)PyTorch深度学习模式与最佳实践,用于构建稳健、高效且可复现的训练流程、模型架构和数据加载。