Threads not laucnhed according to execution configuration

Hi, I am developing a profiling application in which I have to record thread Ids of all threads. My execution configuration is

dim3 Grid(43, 43); 
dim3 Block(16, 16);
function_call<<<Grid, Block>>>(dev_c);

According to this configuration 473,344 threads should be launched. Right number of threads are launched but if I observe the data that every thread has to manipulate only 118,336 data elements are updated. According to profiling results blockIdx.y only ranges from 0 to 10 during the execution and some threads launched have similar overall threadID. I have tried to mimic this condition with the following test case:

#define GRID_SIZE 43
#define BLOCK_SIZE 16

__global__ void gpu_mult(int *c) { 

    int row = blockIdx.y * blockDim.y + threadIdx.y; 
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    
    c[col + row*(GRID_SIZE*BLOCK_SIZE)] = 10;
} 

    /* Part of host code that launches the kernel */
    int size1 = GRID_SIZE*GRID_SIZE*BLOCK_SIZE*BLOCK_SIZE;
    dim3 Grid(43, 43); 
    dim3 Block(16, 16);
    gpu_mult<<<Grid, Block>>>(dev_c);
    
    cudaMemcpy(c, dev_c, size1, cudaMemcpyDeviceToHost);
    
    for (int k =0; k < size1; k++)
    {
        if (c[k] != 10)
        {
            printf("\nAlert: %d, %d, %d\n\n", k, c[k], c[k-1]);
            break;
        }
    }

The output in this case is Alert: 118336, 0, 10
Which shows similar behavior in case of this test case. Can you guys point out if I am missing something.

The size parameter passed to cudaMemcpy is a size in bytes, not elements, therefore this is incorrect:

cudaMemcpy(c, dev_c, size1, cudaMemcpyDeviceToHost);

it should be:

cudaMemcpy(c, dev_c, size1*sizeof(int), cudaMemcpyDeviceToHost);

If you made a similar error on any of your allocations for c or dev_c, those should be corrected also.

When I make that change, your code runs correctly for me:

$ cat t395.cu
#define GRID_SIZE 43
#define BLOCK_SIZE 16
#include <stdio.h>
__global__ void gpu_mult(int *c) {

    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    c[col + row*(GRID_SIZE*BLOCK_SIZE)] = 10;
}
int main(){

    /* Part of host code that launches the kernel */
    int size1 = GRID_SIZE*GRID_SIZE*BLOCK_SIZE*BLOCK_SIZE;
    int *c, *dev_c;
    c = new int;
    cudaMalloc(&dev_c, size1*sizeof(int));
    dim3 Grid(43, 43);
    dim3 Block(16, 16);
    gpu_mult<<<Grid, Block>>>(dev_c);

    cudaMemcpy(c, dev_c, size1*sizeof(int), cudaMemcpyDeviceToHost);

    for (int k =0; k < size1; k++)
    {
        if (c[k] != 10)
        {
            printf("\nAlert: %d, %d, %d\n\n", k, c[k], c[k-1]);
            break;
        }
    }

}
$ nvcc -o t395 t395.cu
$ cuda-memcheck ./t395
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$

Thanks Robert. My bad I completely missed it.