thread private array stored in register or local memory?

for example:

__global__ void kernel()
{
   char thread_data[N];
   ...
}

The thread_data is stored in register or local memory.
In my understanding,

  • if thread_data is shorter than 8 bytes (N<8), it will be stored in register.
  • if N = 16, it is stored in two register.
  • if register file is not enough for all threads, thread_data will be moved to loal memory.

is my understanding right?

There is not a simple set of rules like this.

It is up to the compiler and it depends on what else is going on in your code.

Registers in GPUs are 32-bits, so 4 bytes. There is no way that if N=16 it could be stored in 2 registers, and in fact would probably occupy 16 registers, if it occupied registers at all. In my experience, they would never hold multiple variables in a single register, packed, excepting vector types. Furthermore, registers cannot be indexed, so if you use variable indexing (not resolvable to constants at compile time) into such a local array, it would not be stored in registers at all, even if there were plenty of register space available.

hi Robert,

one more question.
In cub or thrust lib, there is a usage pattern like this:

__global__ void kernel()
{
    int thread_data[N]
    load block data into shared memory
    block sync
    copy a range of shared memory to thread_data
    every thread handle it own thread_data
}

my question:
since thread_data is stored in local memory, it is more efficient that a thread access its own part of shared memory than it access its thread_data. why we need this local array?

since thread_data is stored in local memory, it is more efficient that a thread access its own part of shared memory than it access its thread_data. why we need this local array?

Yes, using shared memory can be more efficient than using local memory. Just remember that shared memory is a limited resource and if one block uses all available shared memory on an SM, your kernel might run with low occupancy.

We need local memory because some problem require more resources than are available on-chip.

It is always smart to profile your code to determine where the bottleneck are.
https://devblogs.nvidia.com/using-nsight-compute-to-inspect-your-kernels/