Simple example: I want to count the number of threads.
I did this.
dim3 dimBlock(16, 16, 1);
dim3 dimGrid(Width/dimBlock.x, Height/dimBlock.y, 1);
kernel<<< dimGrid, dimBlock>>>(pSum_g);
Here, pSum_g is an array (its contents are initialized to 0) in global memory returned by cudaMalloc.
In the kernel, I did simply this
What I expect was pSum_g has the number of threads i.e. Width*Height, but it does not increase as I expected. (the value was 1)
Hope to hear anything about this.