My cuda program contains multiple kernels . After each kernel , i give cudaThreadsynchronize(). The program compiles without errors and when it reaches cudathreadsynchronize , it cannot go beyond it . If I remove all cudathreadsynchronize , the program passes through the kernels and it stops at cudaMemcopy . I think , it is because the cudamemcopy synchronizes all previous calls by default.
If it is the kernel which produces the problem , then why can it pass through successive kernels ?Does the next kernel launch also ensure that all the previous kernel calls are finished ? I searched in forums for long , but could not find any solution .
Thanks for any help.
No it does not. You can queue hundreds of kernels for launch before the first one finishes.
could be lack of registers vs blocksize at some especific kernel?
what is the output when compiling with --ptxas-options=-v ??
best regards
Cristobal
In that case cudaThreadsynchronize() would return with an error code instead of just hanging.
It more likely is an endless loop in the kernel or a kernel that just takes very long to finish.
[quote name=‘neoideo’ date=‘29 January 2011 - 04:11 AM’ timestamp=‘1296254486’ post=‘1185277’]
could be lack of registers vs blocksize at some especific kernel?
what is the output when compiling with --ptxas-options=-v ??
Here is the compilation output with --ptxas-options=-v
ptxas info : Used 21 registers, 60+16 bytes smem, 24 bytes cmem[0], 32 bytes cmem[1], 8 bytes cmem[14]
ptxas info : Compiling entry function ‘_Z43Compute_currentsP14current_paramsPfS1_Pif’ for ‘sm_10’
ptxas info : Used 21 registers, 36+16 bytes smem, 24 bytes cmem[0], 460 bytes cmem[1], 8 bytes cmem[14]
ptxas info : Compiling entry function ‘Z19All_initializationsP14current_paramsPfPiS1_S1_S2_S2_S2’ for ‘sm_10’
ptxas info : Used 19 registers, 56+0 bytes lmem, 64+16 bytes smem, 24 bytes cmem[0], 132 bytes cmem[1], 8 bytes cmem[14]
ptxas info : Compiling entry function ‘Z25Boundary_ConditionsPfS_S’ for ‘sm_10’
ptxas info : Used 11 registers, 24+16 bytes smem, 24 bytes cmem[0], 16 bytes cmem[1], 8 bytes cmem[14]
When I reduce the grid size , keeping the block dimensions the same , the program is running .
No it does not. You can queue hundreds of kernels for launch before the first one finishes.
[/quote]
Thank you for this information . I didn’t know that . So, I didn’t use to give cudaThreadsynchronize in between kernel launches all the time , assuming the first would be finished before the second launch.
No, that is a bad idea. Kernels are queued which means that kernels will always be run in order. cudaThreadSynchronize() is only for when you need to sync the HOST with the GPU (i.e. after an asynchronous memcpy).
As others in this thread have said - you probably have an inifinite loop in a kernel. The host doesn’t lock up until cudaThreadSynchronize() or an implicit sync in a cudaMemcpy. For error checking purposes only, it can be helpful to add cudaThreadSynchronize() after every kernel call to narrow down the location of the root cause.
Ye s, I had a pretty long loop inside the kernel , who caused all the trouble . But interesting thing is that when I removed that loop from the kernal and included in host , it gets executed within one second , whereas inside the kernel it took hours to complete .
Thank you for all replies.
I have the same problem. I use a atomic add in my loop. If I comment it out, it no longer hangs.
I use this for my atomic add
device float pbatomicAdd(float *address, float value)
{ int oldval, newval, readback;
oldval = __float_as_int(*address);
newval = __float_as_int(__int_as_float(oldval) + value);
while ((readback=atomicCAS((int *)address, oldval, newval)) != oldval) {
oldval = readback;
newval = __float_as_int(__int_as_float(oldval) + value);
}
return __int_as_float(oldval);
}
Are you sure that program hangs or it just needs a LONG time to complete because of penalty of atomic operation.
I experimented a little bit with what was added in the atomic add and I also had that (on windows 7) my whole driver is restarted.
Now I changed my homegrown atomicAdd(float*, float) to the CP 2.0 official one and that works fine. That did make me discover one of my input numbers is #Nan, maybe that has something to do with it. The 2.0 is also quiet a bit faster than the one using CAS. DO you know if the CAS based is always guarantee to work, also if several threads add the same number. Could they get confirmation of their success from another thread’s add, making some threads stay forever in the while loop?
The pbatomicAdd() routine you give is guaranteed to always make progress, as (from the perspective of one specific thread) per every iteration of the loop at least one other thread will successfully finish their addition. So deadlocks are not possible. However there is no fairness, so that newly incoming threads which also update the same variable can potentially stall a thread forever (or for however long there are new updates coming from other threads). This situation is called a livelock.
You should use the above routine only when there is only light contention for the variable. The same probably applies for the built-in atomicAdd, but on a different scale since it is so much faster. If there is strong contention, you should change your algorithm to use a reduction operation, or at least distribute the contention between multiple variables which will be summed later (e.g., use one variable per block in shared memory, and add that back to the global variable only at the end of the block).
Contention is quiet light, but for this specific problem, I could probably solve it in another way using a reduction. For some other problems reducing it later on gives you a memory bottleneck. Many threads will output mostly zeros requiring a lot of memory before they are all summed.