I’m starting to inform myself on CUDA.
It seems that each thread of a kernel executes with an XYZ index.
Does that mean that if I execute myfunction<<<1,10>>>(parameters) it executes 1000 threads each with its own unique XYZ index?
I guess I’m wrong cause that would be inefficient, for example, if I’m working on a 2 dimensions array, where I only need to refer to threadId.x and threadId.y
In a 10x10 array I need only 100 threads, instead of 1000
With CUDA, you can index things with threads in 1d, 2d, or 3d.
Calling this :
kernel<<<1, 10>>>();
is actually only calling 1 block of threads with 10 threads per block so only 10 threads will actually be run.
Hopefully this will help : nvidia - Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation) - Stack Overflow
“It seems that each thread of a kernel executes with an XYZ index.”
It seems that each thread of a kernel executes with a common, shared XYZ index.
myfunction<<<1,10>>>(parameters), read within context would imply a grid of (1, 1, 1), and a block of (10, 1, 1)
grid multiplied by block gives 10 threads, not 1000