I posted this in another section of the forum but I’m not sure whether I posted it in the right place. So I post it here again now. Sorry if I bother you.
I did a very simple test on Tesla recently and I got a very strange result. As you may see in my code, I did a simple matrix add on GPU using float4 by decomposing the whole matrix into several 64*64 ones. But when I run it under Tesla, it would give me an incorrect result once or twice out of ten times. But when I run it under emulation mode, it runs correctly.
Firstly I thought the problem may be synchronization. I added __syncthreads () after each statement of the add kernel but the problem is still there. This made me start to doubt whether it’s a bug of hardware scheduling? Could anybody help?
By the way, what’s the use for cudaThreadSynchronize ()? Under what circumstances would it take effect? Shouldn’t all the threads finish their work automatically when we go back to the host code? Thanks. test.cu (2.39 KB)
Hi, thanks for your reply. Actually, I’m not sure whether the hardware is ok because I’m doing it on the remote HPC server of my college using putty. But I think it should be ok since there is a professional team maintaining it. But what hardware problems may cause this to happen?
To be specific, I compiled the code like this “nvcc test.cu -I/opt/cuda-sdk/common/inc -L/opt/cuda-sdk/common/lib/linux -L/opt/cuda-sdk/lib64 -lcutil -lcudart -o test.out” and then executed “./test.out” repeatedly (more than ten times). But the problem is still here. I might need another GPU to do the test again I think.
I compiled it with [font=“Courier New”]nvcc -arch=sm_13 test.cu -o test[/font] and [font=“Courier New”]nvcc test.cu -o test[/font] (having copied cutil.h into the directory first), both executables worked fine more than ten times.