strange behavior of kernel-calls

hello,

i have written a c-programm(depth-first search) for solving a puzzle. i used the cuda-profiler to test the program and in the profiler i show that only the first cudaMalloc are executed and nothing more(no kernel …). in real, he executes the cudaMalloc and
the first kernel-call at point 6(pseudo-code below), and after try? to execute the 2,3,4 kernel-call at point 9 the profiler show
no kernel executing info. if i remove the kernel-calls 2,3,4 then the profiler shows the execution of kernel call 1.

the pseudo-code of the problematic code:

  1. pointer1=cudaMalloc(500kb global gpu ram)
  2. pointer2=cudaMalloc(4 byte global gpu ram for an unsigned int for parameter transfer to and from the gpu-mem)
  3. jump the a subfunction
  4. unsigned int a=1
  5. cudaMemcyp(pointer2,&a,4,HostToDevice)
  6. kernel call bla<<<16,9>>> (x,y,pointer1,pointer2) //test something, if a thread detectes a error, he sets *pointer2=0
    7 cudaMemcyp(&a,pointer2,4,DeviceToHost)
    8 only continue, if a=1, else return
  7. 3 kernel calls like that (he comes to this position)
    bla2<<<16,350>>> (x,y,pointer1,start)
    recalculate new “start”
    bla2<<<16,350>>> (x,y,pointer1,start)
    recalculate new “start”
    bla2<<<16,350>>> (x,y,pointer1,start)

no kernel of the 3 in point 9 are executed. if i write on of the 3 calls separate on the first line in the subfunction and return after it,
it is called !!!

this works:
subfunction start:
bla2<<<16,350>>> (x,y,pointer1,start)
return 0

it is true, that the host program, after calling a kernel waits until ALL threads inside are finished? if not, how can i archive this?
it is importend, because the code should only continue, if the test on point 6(pseudo-code) are passed. and the test-result are
only finished, if all threads are finised.

how can i calculate the memory-using of a kernel? is the kernel-code stored in global gpu-mem and the variable-mem
stored in the shared-mem of the thread-processor the kernel runs?

it is nesessary the capsulate every cudaMalloc, cudaMemcpy in this CUDA_SAFE_CALL -macro?
and why cudaThreadSynchronize?

my system:
athlon 64 X2 with 2gb ram
os: fedora8 64bit
cuda driver 177.73(64bit), cuda sdk 2.0, profiler 1.0
gpu: gforce 9800gtx, 512mb

sorry, i´m new to gpu-coding and the tutorials are not always answering all questions.
:unsure:

ok, solved. was a simple coding-bug.
now, i must find out, why he freezes after a couple of iterations… :">

No, a kernel is asynchronous.

Kernel variables are stored in registers unless you want them in shared memory. If you have too many, some would go into local mem. Arguments to kernels are passed through shared memory. Nominal values for local and shared memory are obtained by looking into the .cubin file at the respective kernel.

If something is fishy, you build in debug mode and these macros will report the possible errors.

…in case you want to make sure a kernel has executed before your CPU code can advance.

It takes time, but they do answer most of them. Besides, I found this forum invaluable. There are some people here who should be decorated by NVidia. External Media