Imagine I have 1024 blocks on GPU that each block has 1024 threads, and my GPU has 256 cores. due to my code :
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
float sum = 0.0;
for (int i = -10; i <= 10; i++)
for (int j = -10; j <= 10; j++)
{
int idx = ((row + j)) + (col + i);
sum += inout[idx];
}
int index = (row) + col;
output[index] = sum / (441);
Each core calculate one “sum” (variable sum within the loops) ? In other means in same time my GPU calculate 256 “sum” ?
A CUDA core is really not like a CPU core. A CUDA core is basically a single-precision floating point multiply-add unit. It supports basically 3 machine language instructions: FADD, FMUL, and FMA. It doesn’t do anything else.
a GPU core is undoubtedly being used to process the FADD instruction associated with this line of code:
sum += inout[idx];
All the rest of the code you have shown is not using a CUDA core (with the exception of the last line)
It can’t be definitively stated that " in same time my GPU calculate 256 “sum” ", however that is a reasonable statement for the peak theoretical througput of the machine.
I don’t know why that matters. From a high level perspective, your code is running on all the CUDA cores in your GPU.
Perhaps I don’t understand the question. Perhaps you might want to learn how to use one of the profilers. Or perhaps you might want to study the CUDA deviceQuery sample code.