#define ARR_LEN (1024*1024*1024)
__global__ void simulated(long long int *arr) {
int tid = blockDim.x * blockIdx.x + threadIdx.x;
long long int local1[ARR_LEN];
long long int local2[ARR_LEN];
long long int local3[ARR_LEN];
long long int local4[ARR_LEN];
long long int local5[ARR_LEN];
long long int local6[ARR_LEN];
for (int i = 0; i < ARR_LEN; ++i) {
local1[i] += i*1;
local2[i] += i*2;
local3[i] += i*3;
local4[i] += i*4;
local5[i] += i*5;
local6[i] += i*6;
arr[i] =
local1[i] + local2[i] + local3[i] + local4[i] + local5[i] + local6[i];
}
}
I am testing the largest possible array size that I can declare inside a CUDA kernel. However, I am little confused because the code snippet works perfectly fine even with a very large array size.
In the shown example, my understanding is that each thread basically declares 48GB data (8GB per array and 6 arrays). However, those are not really shown in memory usage. So where do those data go physically? Can anyone give me some insights?