Why template is beneficial for compiler and local memory?

A friend told me:

“The template line specifies that each thread processes 8 data items through a compile-time constant, avoiding the Local Memory issue caused by the inability to use registers with runtime constants (since registers cannot be addressed).”

Just like this:

template <int kNumElemPerThread = 8>
__global__ void vector_add_local_tile_multi_elem_per_thread_half(
    half *z, int num, const half *x, const half *y, const half a, const half b, const half c) {
  using namespace cute;

Emmm, why?? I do not understand…local memory?

Local to the thread. See the Best Practices Guide.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.