Indeed, this code was translated from my early version written with Brook (BrookGPU). In Brook, all the kernels are launched in a non-blocking way, it returns immediately no matter it complete or not. One have to use a streamWrite() to wait for all threads complete and retrieve the results. I don’t know how this is implemented in CUDA: does Update_stepx() return immediately as Brook? or it only return when it completes? Do I need to worry about thread synchronization for the above loop?
First off, this won’t work. Maximum allowed blocksize is 512 threads, while 128*128 = 16k. Use smaller blocks and lots of them.
In CUDA, kernels are launched asynchronously (w/o blocking) but they are queued and subsequent ones will not start until the preceding one has finished. Memcopys implicitly block until the last kernel has finished, you don’t need to explicitly synchronize before copying back the results. The loop is fine, no extra synchronization is needed.
thank you for both of your replies. Surprisingly, CUDA did not crash or give error messages for the wrong block-size :(. After changing the grid/block size to proper values, and going through a series of frustrations of index calculations, I finally got correct answers. Unfortunately, I found the CUDA code is about 3 times slower than my original Brook code (which is 40x faster than CPU with a 8800GT card). I post more questions at http://forums.nvidia.com/index.php?showtopic=88645
one follow up question: about grid and block size, my data stored on a 2D texture of size (1284, 1284), now I define
it gave me the correct answer, does gridsize*blocksize have to be the same as my texture size? I mean what will happen if I use
to call my kernels?
I played with different sizes with my code, in many cases, CUDA simply skipped executions and returned me an empty array. This is very annoying, how can I let it throw an exception or exit with an error message when the blocksize/gridsize are not valid?
If you would have checked for errors you would have found that CUDA reported too many resources requested for launch.
I would advise you to read the programming guide in detail, it can be that some assumptions in Brook are completely different in CUDA, like your blocksize of 16. That alone accounts for a two-fold slowdown.