Imagine I have 1024 blocks on GPU that each block has 1024 threads, and my GPU has 256 cores. due to my code :

int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
float sum = 0.0;
for (int i = -10; i <= 10; i++)
   for (int j = -10; j <= 10; j++)
 	 int idx = ((row + j)) + (col + i);
	 sum += inout[idx];
int index = (row) + col;
output[index] = sum / (441);

Each core calculate one “sum” (variable sum within the loops) ? In other means in same time my GPU calculate 256 “sum” ?

A CUDA core is really not like a CPU core. A CUDA core is basically a single-precision floating point multiply-add unit. It supports basically 3 machine language instructions: FADD, FMUL, and FMA. It doesn’t do anything else.

a GPU core is undoubtedly being used to process the FADD instruction associated with this line of code:

sum += inout[idx];

All the rest of the code you have shown is not using a CUDA core (with the exception of the last line)

It can’t be definitively stated that " in same time my GPU calculate 256 “sum” ", however that is a reasonable statement for the peak theoretical througput of the machine.

So how I can understand how many cores are available in my code running time? and my code run on how many cored ?

I don’t know why that matters. From a high level perspective, your code is running on all the CUDA cores in your GPU.

Perhaps I don’t understand the question. Perhaps you might want to learn how to use one of the profilers. Or perhaps you might want to study the CUDA deviceQuery sample code.

I am writing a paper about parallel processing, because of it I wanted to know.