when does a kernel return?

I am new to CUDA and here is a naive question: I wrote a finite difference code to do some EM modeling using CUDA, the code looks like

[font=“Courier New”]dim3 gridsize(4,4);

dim3 blocksize(128,128);

global void Update_step1();

global void Update_step2();

cudaMemcpy(…);

for(i=0;i<max_iter;i++){

Update_step1<<<gridsize,blocksize>>>(data1,data2);

Update_step2<<<gridsize,blocksize>>>(data1,data2);

}

cudaMemcpy(…);

[/font]

Indeed, this code was translated from my early version written with Brook (BrookGPU). In Brook, all the kernels are launched in a non-blocking way, it returns immediately no matter it complete or not. One have to use a streamWrite() to wait for all threads complete and retrieve the results. I don’t know how this is implemented in CUDA: does Update_stepx() return immediately as Brook? or it only return when it completes? Do I need to worry about thread synchronization for the above loop?

thanks

the maximum amount of threads per block is 512 (and often less because of the amount of registers used).

You are asking for 10k threads per block.

First off, this won’t work. Maximum allowed blocksize is 512 threads, while 128*128 = 16k. Use smaller blocks and lots of them.

In CUDA, kernels are launched asynchronously (w/o blocking) but they are queued and subsequent ones will not start until the preceding one has finished. Memcopys implicitly block until the last kernel has finished, you don’t need to explicitly synchronize before copying back the results. The loop is fine, no extra synchronization is needed.

thank you for both of your replies. Surprisingly, CUDA did not crash or give error messages for the wrong block-size :(. After changing the grid/block size to proper values, and going through a series of frustrations of index calculations, I finally got correct answers. Unfortunately, I found the CUDA code is about 3 times slower than my original Brook code (which is 40x faster than CPU with a 8800GT card). I post more questions at http://forums.nvidia.com/index.php?showtopic=88645

one follow up question: about grid and block size, my data stored on a 2D texture of size (1284, 1284), now I define
dim3 gridsize(128,128);
dim3 blocksize(4,4);
it gave me the correct answer, does gridsize*blocksize have to be the same as my texture size? I mean what will happen if I use
dim3 gridsize(128,128);
dim3 blocksize(1,1);
to call my kernels?

I played with different sizes with my code, in many cases, CUDA simply skipped executions and returned me an empty array. This is very annoying, how can I let it throw an exception or exit with an error message when the blocksize/gridsize are not valid?

4x4 is not a very good size for a block either, 1x1 is even worse. That’s less than 32 threads. You want your blocks to be 32 * N in size (N>=1). 8x8 seems natural. Or 16x16.

As for speed, care to post your kernels code? The slow speed and weird behaviour with empty output suggest you have problems within the kernels.

If you would have checked for errors you would have found that CUDA reported too many resources requested for launch.

I would advise you to read the programming guide in detail, it can be that some assumptions in Brook are completely different in CUDA, like your blocksize of 16. That alone accounts for a two-fold slowdown.