Very strange problem on CUDA

Hardy616 · June 15, 2010, 3:45pm

Hi all,

I did a very simple test on Tesla recently and I got a very strange result. As you may see in my code, I did a simple matrix add on GPU using float4 by decomposing the whole matrix into several 64*64 ones. But when I run it under Tesla, it would give me an incorrect result once or twice out of ten times. But when I run it under emulation mode, it runs correctly.

Firstly I thought the problem may be synchronization. I added __syncthreads () after each statement of the add kernel but the problem is still there. This made me start to doubt whether it’s a bug of hardware scheduling? Could anybody help? Thanks.
test.cu (2.39 KB)

Hardy616 · June 15, 2010, 3:49pm

By the way, what’s the use for cudaThreadSynchronize ()? Under what circumstances would it take effect? Shouldn’t all the threads finish their work automatically when we go back to the host code? Thanks.

avidday · June 15, 2010, 3:58pm

To answer the second question, all kernel launches are asynchronous - the launch always returns immediately whether the kernel is finished or not. Which is why we have a synchronization primative.

seibert · June 15, 2010, 5:21pm

To add to what aviday said: Usually you only need it for benchmarking. Right after a kernel launch, most people cudaMemcpy() the results back to the host, and cudaMemcpy implicitly waits until previous kernels are finished before performing the copy.

Gregory_Diamos · June 15, 2010, 5:33pm

Nothing looks obviously wrong, and it runs through ocelot and valgrind with no memory errors or uninitialized values…

Hardy616 · June 15, 2010, 7:50pm

Thank you all guys. I think I have to test it on another GPU now.

SPWorley · June 15, 2010, 8:27pm

Are there any issues that valgrind catches and Ocelot misses?

Gregory_Diamos · June 15, 2010, 9:41pm

The main thing is that valgrind catches uninitialized values that are loaded from memory that change the control flow or addresses of memory operations. Ocelot could potentially detect this as well but there is a huge overhead of doing it naively and doing it cleverly is relatively complicated.