According to this configuration 473,344 threads should be launched. Right number of threads are launched but if I observe the data that every thread has to manipulate only 118,336 data elements are updated. According to profiling results blockIdx.y only ranges from 0 to 10 during the execution and some threads launched have similar overall threadID. I have tried to mimic this condition with the following test case:
#define GRID_SIZE 43
#define BLOCK_SIZE 16
__global__ void gpu_mult(int *c) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
c[col + row*(GRID_SIZE*BLOCK_SIZE)] = 10;
}
/* Part of host code that launches the kernel */
int size1 = GRID_SIZE*GRID_SIZE*BLOCK_SIZE*BLOCK_SIZE;
dim3 Grid(43, 43);
dim3 Block(16, 16);
gpu_mult<<<Grid, Block>>>(dev_c);
cudaMemcpy(c, dev_c, size1, cudaMemcpyDeviceToHost);
for (int k =0; k < size1; k++)
{
if (c[k] != 10)
{
printf("\nAlert: %d, %d, %d\n\n", k, c[k], c[k-1]);
break;
}
}
The output in this case is Alert: 118336, 0, 10
Which shows similar behavior in case of this test case. Can you guys point out if I am missing something.