I declared two 2D-unsigned char memory in device:

global void fun1(void)
unsigned char block1[16][16];
unsigned char block2[19][19];

getBlockDataFromTexMemory(block1, 16, 16);
getBlockDataFromTexMemory(block2, 19, 19);

uint sad = 0;

for(by=0; by<4; by++){
for(bx=0; bx<4; bx++){
for (int k = 0; k < 16; k++) {
for (int l = 0; l < 16; l++) {
sad += abs(curBlock[k][l] - prevBlock[k+by][l+bx]);


My question is that all threads in a same block can access the same address in the device memory?

My data is from 8-bit images and in this case, should I use float data type or is it ok to use unsigned char type?

Thanks in advance.

The code you have shown doesn’t do any shared memory loads, and in fact I think in a non-debug case would be mostly optimized away by the compiler.

Thanks for your reply.
By the way, what do you mean that a non-debug case would be mostly optimised away by the compiler?

Does it mean that there is no problem with my code ?

I can’t tell anything about the correctness of your code.

If you compile this code without the debug switch (-G), then I would expect that the compiler would create a d_motionEstimation kernel that has essentially no machine code in it at all. Your kernel definition does not modify any global state, so the compiler is free to eliminate the code you have written for it – it doesn’t do anything that is visible outside of the kernel itself.

I see. Thanks. But, I have seen this problem when I did CUDA debugging mode.

Is this Nsight bug?
I am so confused.

Yes, in debugging mode, the compiler will generate code for the kernel call.

But even in that case, your code that you have shown here does not do any shared memory loads. So I don’t believe a shared memory load error could arise from this code.