Compile-time optimization guidance for HIP/C++ kernels using opus.hpp. Use when writing or reviewing OPUS kernels, analyzing compile time, reducing template instantiation overhead, or optimizing hipcc build performance.
Techniques for reducing HIP/C++ kernel compile time when using opus.hpp. These patterns were developed while optimizing a GQA flash attention kernel from 4.8s to 1.5s (70% reduction) in device-only compilation.
For kernel development with OPUS, use these headers from csrc/include/:
opus/opus.hpp — the OPUS template library + device intrinsic wrappers. This is the only include needed for device code. Provides opus::thread_id_x(), opus::block_id_x(), opus::sync_threads(), opus::warp_all(), etc.opus/hip_minimal.hpp — minimal HIP host-side only declarations (dim3, hipMalloc, hipLaunchKernelGGL, etc.). Use on the host pass instead of <hip/hip_runtime.h>.hipcc my_kernel.cu -I<aiter_root>/csrc/include -D__HIPCC_RTC__ -std=c++20 -O3 --offload-arch=gfx950
| HIP runtime | opus:: wrapper | LLVM builtin |
|---|---|---|
threadIdx.x | opus::thread_id_x() | __builtin_amdgcn_workitem_id_x() |
blockIdx.x | opus::block_id_x() | __builtin_amdgcn_workgroup_id_x() |
blockDim.x | opus::block_size_x() | __builtin_amdgcn_workgroup_size_x() |
gridDim.x * blockDim.x | opus::grid_size_x() | __builtin_amdgcn_grid_size_x() |
__syncthreads() | opus::sync_threads() | __builtin_amdgcn_s_barrier() |
__all(pred) | opus::warp_all(pred) | — |
If anything is missing, contact the maintainer ([email protected]) for adding support.
This is the single most impactful technique. hipcc always performs two compilation passes on every .hip/.cu file — one for the host (x86_64) and one for the device (AMDGPU). The heavy opus.hpp template library is only needed on the device side, but without a guard, hipcc parses it on BOTH passes, doubling the frontend cost.
Always structure your kernel files like this:
// my_kernel.cu
#ifdef __HIP_DEVICE_COMPILE__
// ── Device pass: include opus.hpp and define kernels ──
#include "opus/opus.hpp"
__global__ __launch_bounds__(256, 2)
void my_kernel(const float* src, float* dst, int n) {
// ... opus layout, load, store, MMA, etc.
}
#else
// ── Host pass: minimal declarations + launcher only ──
#include "opus/hip_minimal.hpp"
__global__ void my_kernel(const float* src, float* dst, int n); // declaration only
extern "C" void run_my_kernel(const void* d_src, void* d_dst, int n) {
dim3 grid((n + 255) / 256), block(256);
hipLaunchKernelGGL(my_kernel, grid, block, 0, 0,
(const float*)d_src, (float*)d_dst, n);
hipDeviceSynchronize();
}
#endif
Why this works:
opus.hpp + kernel definitions — full template expansionopus/hip_minimal.hpp (~70 lines) + kernel declaration + launch wrapperextern "C" launcher can be called from Python via ctypes.CDLL — no pybind11/torch extension neededCompile flags:
hipcc my_kernel.cu \
-I<aiter_root>/csrc/include \
-D__HIPCC_RTC__ \
-std=c++20 -O3 -ffast-math \
--offload-arch=gfx950 \
-fPIC -shared -o my_kernel.so
<hip/hip_runtime.h> with opus/hip_minimal.hppStandard <hip/hip_runtime.h> expands to ~190K preprocessed lines. The aiter-provided opus/hip_minimal.hpp (~80 lines) declares only what's needed — dim3, hipLaunchKernelGGL, hipMalloc/hipFree, __launch_bounds__, __shared__/__device__/__global__, and __all(). Use AMDGCN compiler builtins for device intrinsics:
int tid = __builtin_amdgcn_workitem_id_x(); // threadIdx.x
int bid = __builtin_amdgcn_workgroup_id_x(); // blockIdx.x
int bsz = __builtin_amdgcn_workgroup_size_x(); // blockDim.x
__builtin_amdgcn_s_barrier(); // __syncthreads()
-D__HIPCC_RTC__ to suppress implicit includesEven with minimal headers, hipcc's implicit __clang_hip_runtime_wrapper.h pulls in <cmath>, <cstdlib>, etc. The -D__HIPCC_RTC__ flag skips these. Provide #define INFINITY __builtin_huge_valf() if needed.
The C++ binding layer is often the biggest compile cost. The extern "C" + ctypes.CDLL pattern from Section 0 eliminates it entirely:
| Binding | Compile time |
|---|---|
torch CUDAExtension | ~21s |
| pybind11 + Ninja | ~4.2s |
ctypes (extern "C", see Section 0) | ~0.4s |
static_for where compile-time indices aren't neededEach iteration of static_for<N>([&](auto I){...}) creates a unique lambda instantiation. For large N, this dominates compile time. Replace with plain for loops when the loop body doesn't need compile-time I:
// SLOW: N unique lambda instantiations
static_for<N>([&](auto I) {
r[I.value] = load<vec>(offsets[I.value]);
});
// FAST: 1 instantiation, compiler unrolls identically
for (index_t i = 0; i < N; i++) {
r[i] = load<vec>(offsets[i]);
}
When you still need static_for: If the body uses I as a template argument (e.g., number<I.value>{} for set_slice, slice, or immediate-offset _tr_load<vec, off>), you must keep static_for.
flat_to_coords instead of compile-time multi-index decompositionlayout_to_offsets converts a layout into a precomputed offset array using a runtime loop with flat_to_coords, which produces tuple<index_t, ...> (one type for all iterations) instead of tuple<number<a>, number<b>, ...> (unique type per iteration):
// SLOW: N unique coord_to_linear instantiations (one per multi-index combination)
static_ford(issue_space_vec, [&](auto... ids) {
offsets[u_linear(ids...)] = u(ids...);
});
// FAST: 1 coord_to_linear instantiation (all iterations use tuple<index_t, ...>)
for (index_t i = 0; i < num_issues; i++) {
offsets[i] = u(flat_to_coords(i, make_index_seq<ndim>{}, issue_space_vec));
}
Repeated constexpr evaluations in multiple methods trigger re-evaluation in each:
// SLOW: y_shape_a() + reduce_tuple_mul evaluated in every operator()/step_k() overload
constexpr auto a_len = get<0>(reduce_tuple_mul(MMA::y_shape_a()));
// FAST: cached once as class member
static constexpr index_t mma_a_len = get<0>(reduce_tuple_mul(MMA::y_shape_a())).value;
__builtin_convertvector for type conversionReplaces N-element element-by-element cast_impl pack expansion with a single LLVM intrinsic:
// SLOW: 64-element pack expansion
return vector_return_type<D, decltype(cast<D>(get<Is>(s)))...>{cast<D>(get<Is>(s))...};
// FAST: single builtin call
return __builtin_convertvector(s, vector_t<D, size<S>()>);
__builtin_shufflevector for vector slice/concatReplaces element-by-element make_vector(get<Is>(c)...) with a single shuffle:
// SLOW: N-element braced init
return make_vector(get<Is>(c)...);
// FAST: single shuffle (returns GCC-style vector, bit_cast to ext_vector_type)
using R = vector_t<scalar_type, sizeof...(Is)>;
return __builtin_bit_cast(R, __builtin_shufflevector(c, c, Is...));
concat_tuple with direct indexingconcat_tuple creates intermediate tuple types when concatenating >4 tuples. Replace with direct per-element computation:
// unfold_x_stride: instead of concat_tuple(per_group_results...)
// compute each element's stride directly via unfold_x_stride_at<J>()
// pickup_shape: instead of concat_tuple(conditional<match, tuple<T>, tuple<>>{}...)
// build a filtered index sequence, then make_tuple(get<filtered_indices>(Shape{})...)
// flatten_tuple: instead of concat_tuple(explode_tuple(get<Is>(t))...)
// directly index as get<local>(get<group>(t)) via flatten_at<T, J, GS>()
std::common_type// SLOW: triggers recursive std::common_type<D, D, D, ..., D> with 64 types
return vector_return_type<void, decltype(cast<D>(get<Is>(s)))...>{...};
// FAST: D is already known, skip common_type entirely
return vector_return_type<D, decltype(cast<D>(get<Is>(s)))...>{...};
// reduce_tuple_mul for tuple<number<>...>: fold expression instead of recursive reduction
template<typename... Ns, std::enable_if_t<(is_constant_v<Ns> && ...), bool> = true>
constexpr auto reduce_tuple_mul(const tuple<Ns...>&) { return tuple<number<(Ns::value * ...)>>{}; }
One file with 14 MFMA template instantiations (~3.9s) bottlenecks parallel builds. Split into per-type files (f16/f32/f8) to balance workload:
test_mfma.cu (3.9s) -> test_mfma_f16.cu (0.9s) + test_mfma_f32.cu (0.5s) + test_mfma_f8.cu (0.9s)
hipcc --genco for device-only compilation when launching from PythonEliminates the host pass entirely. Python loads the .hsaco via hipModuleLoad and launches with hipModuleLaunchKernel (HIP driver API).
-ftime-trace for profilinghipcc kernel.cc --cuda-device-only -c -o /dev/null \
-Xclang -ftime-trace=trace.json
Analyze with chrome://tracing or a script:
import json
with open('trace.json') as f: data = json.load(f)
events = data.get('traceEvents', data)
inst = [(e['dur'], e['args']['detail']) for e in events
if e.get('name') == 'InstantiateFunction' and 'dur' in e]
inst.sort(key=lambda x: -x[0])
for dur, name in inst[:20]:
print(f"{dur/1000:8.1f}ms {name[:100]}")
| Technique | Typical savings | Where applied |
|---|---|---|
Separate device/host code (__HIP_DEVICE_COMPILE__ guard) | ~50% total | All .cu/.hip files — always do this first |
Runtime for loops in load/store/MMA | 30-60% frontend | buffer_view::load/store, tiled_mma_adaptor::operator() |
Runtime flat_to_coords | 40-50% frontend | layout_to_offsets |
__builtin_convertvector | 5-10% frontend | cast for vectors >16 elements |
__builtin_shufflevector | 3-5% frontend | slice_impl for vectors |
| Cache constexpr members | 10-15% frontend | layout_load_traits, mma_a/b/c_len |
| Direct indexing (bypass concat_tuple) | 5-10% frontend | unfold_x_stride, pickup_shape, flatten_tuple |
-D__HIPCC_RTC__ | ~25% per-file | Compiler flags |
hipcc --genco | ~15% per-file | Python-launched kernels |
| Split large TU files | Better parallelism | Test suites, multi-kernel builds |