Very strange problem on CUDA

Hi all,

I did a very simple test on Tesla recently and I got a very strange result. As you may see in my code, I did a simple matrix add on GPU using float4 by decomposing the whole matrix into several 64*64 ones. But when I run it under Tesla, it would give me an incorrect result once or twice out of ten times. But when I run it under emulation mode, it runs correctly.

Firstly I thought the problem may be synchronization. I added __syncthreads () after each statement of the add kernel but the problem is still there. This made me start to doubt whether it’s a bug of hardware scheduling? Could anybody help? Thanks.
test.cu (2.39 KB)

By the way, what’s the use for cudaThreadSynchronize ()? Under what circumstances would it take effect? Shouldn’t all the threads finish their work automatically when we go back to the host code? Thanks.

To answer the second question, all kernel launches are asynchronous - the launch always returns immediately whether the kernel is finished or not. Which is why we have a synchronization primative.

To add to what aviday said: Usually you only need it for benchmarking. Right after a kernel launch, most people cudaMemcpy() the results back to the host, and cudaMemcpy implicitly waits until previous kernels are finished before performing the copy.

Nothing looks obviously wrong, and it runs through ocelot and valgrind with no memory errors or uninitialized values…

Thank you all guys. I think I have to test it on another GPU now.

Are there any issues that valgrind catches and Ocelot misses?

The main thing is that valgrind catches uninitialized values that are loaded from memory that change the control flow or addresses of memory operations. Ocelot could potentially detect this as well but there is a huge overhead of doing it naively and doing it cleverly is relatively complicated.