I have this program
cudaMemcpy(h_out, d_out, sizeof(short)644, cudaMemcpyDeviceToHost);
for (int i=0;i<4;i++)
and my kernel as:-
global void readTexels(short* d_out)
unsigned int i = blockIdx.xblockDim.x + threadIdx.x;
unsigned int j = blockIdx.yblockDim.y + threadIdx.y;
i want is a 8 blocks with 8 rows and 4 cols each. so that the total count of thread is 256.
But i am getting only 76 threads as it seems only upto d_out, value -27 is assigned.
I think i have done somethign wrong while declaring my dimgrid and dimblock dimension.
help me correct this…