Variable in Kernel

If I define a variable or an array inside kernel without declaring it to be “shared” type, where will it be positioned? local memory or global memory? or just register or cache L1?

Variables declared without a storage modifier will be kept in registers except when:

  • Not enough registers are available
  • Multi-element data types are accessed with offsets computed at runtime (usually arrays)

For those cases, the data is pushed from registers into local memory. Local memory is just global memory where private storage has been allocated for each thread. The standard L1 and L2 cache will still be used to access local memory.

Sounds like the local memory is as slow as global memory for the kernel. Then why do we need such a memory type?

And is there a way to verify that the L1 is being used?

Thank you

Yes, local memory is as slow as global memory. The reason for local memory is to deal with the finite number of registers available to each thread. CUDA devices (much like CPUs) need scratch space to swap out intermediate values in a complex calculation. “Local memory” is just a label for the global memory locations used to handle this per-thread scratch space. It is allocated by the driver when the kernel starts based on some metadata placed by the compiler into the .cubin.

I’m not sure if you can force local memory access to bypass the L1. Perhaps one of the ptxas options that control the cache modifiers will work. (I don’t have the reference handy…)