Array size upper bound in kernel

#define ARR_LEN (1024*1024*1024)

__global__ void simulated(long long int *arr) {
  int tid = blockDim.x * blockIdx.x + threadIdx.x;

  long long int local1[ARR_LEN];
  long long int local2[ARR_LEN];
  long long int local3[ARR_LEN];
  long long int local4[ARR_LEN];
  long long int local5[ARR_LEN];
  long long int local6[ARR_LEN];

  for (int i = 0; i < ARR_LEN; ++i) {
    local1[i] += i*1;
    local2[i] += i*2;
    local3[i] += i*3;
    local4[i] += i*4;
    local5[i] += i*5;
    local6[i] += i*6;
    arr[i] =
        local1[i] + local2[i] + local3[i] + local4[i] + local5[i] + local6[i];

I am testing the largest possible array size that I can declare inside a CUDA kernel. However, I am little confused because the code snippet works perfectly fine even with a very large array size.

In the shown example, my understanding is that each thread basically declares 48GB data (8GB per array and 6 arrays). However, those are not really shown in memory usage. So where do those data go physically? Can anyone give me some insights?

When I try to run your code using compute-sanitizer I get errors.

The maximum amount of local memory per thread cannot exceed 512KB, currently. Other factors may also prevent you from even reaching that upper bound.

1 Like

Thanks for the pointer!

I forgot about the sanity checking tool.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.