Program hangs at cudaThreadsynchronize

kamsandh · January 28, 2011, 6:25pm

My cuda program contains multiple kernels . After each kernel , i give cudaThreadsynchronize(). The program compiles without errors and when it reaches cudathreadsynchronize , it cannot go beyond it . If I remove all cudathreadsynchronize , the program passes through the kernels and it stops at cudaMemcopy . I think , it is because the cudamemcopy synchronizes all previous calls by default.
If it is the kernel which produces the problem , then why can it pass through successive kernels ?Does the next kernel launch also ensure that all the previous kernel calls are finished ? I searched in forums for long , but could not find any solution .
Thanks for any help.

tera · January 28, 2011, 9:28pm

No it does not. You can queue hundreds of kernels for launch before the first one finishes.

neoideo · January 28, 2011, 10:41pm

could be lack of registers vs blocksize at some especific kernel?

what is the output when compiling with --ptxas-options=-v ??

best regards

Cristobal

tera · January 29, 2011, 3:48am

In that case cudaThreadsynchronize() would return with an error code instead of just hanging.

It more likely is an endless loop in the kernel or a kernel that just takes very long to finish.

kamsandh · January 29, 2011, 1:39pm

[quote name=‘neoideo’ date=‘29 January 2011 - 04:11 AM’ timestamp=‘1296254486’ post=‘1185277’]

could be lack of registers vs blocksize at some especific kernel?

what is the output when compiling with --ptxas-options=-v ??

Here is the compilation output with --ptxas-options=-v

ptxas info : Used 21 registers, 60+16 bytes smem, 24 bytes cmem[0], 32 bytes cmem[1], 8 bytes cmem[14]

ptxas info : Compiling entry function ‘_Z43Compute_currentsP14current_paramsPfS1_Pif’ for ‘sm_10’

ptxas info : Used 21 registers, 36+16 bytes smem, 24 bytes cmem[0], 460 bytes cmem[1], 8 bytes cmem[14]

ptxas info : Compiling entry function ‘Z19All_initializationsP14current_paramsPfPiS1_S1_S2_S2_S2’ for ‘sm_10’

ptxas info : Used 19 registers, 56+0 bytes lmem, 64+16 bytes smem, 24 bytes cmem[0], 132 bytes cmem[1], 8 bytes cmem[14]

ptxas info : Compiling entry function ‘Z25Boundary_ConditionsPfS_S’ for ‘sm_10’

ptxas info : Used 11 registers, 24+16 bytes smem, 24 bytes cmem[0], 16 bytes cmem[1], 8 bytes cmem[14]

When I reduce the grid size , keeping the block dimensions the same , the program is running .

kamsandh · January 29, 2011, 1:45pm

No it does not. You can queue hundreds of kernels for launch before the first one finishes.

[/quote]

Thank you for this information . I didn’t know that . So, I didn’t use to give cudaThreadsynchronize in between kernel launches all the time , assuming the first would be finished before the second launch.

DrAnderson42 · January 31, 2011, 12:42pm

No, that is a bad idea. Kernels are queued which means that kernels will always be run in order. cudaThreadSynchronize() is only for when you need to sync the HOST with the GPU (i.e. after an asynchronous memcpy).

As others in this thread have said - you probably have an inifinite loop in a kernel. The host doesn’t lock up until cudaThreadSynchronize() or an implicit sync in a cudaMemcpy. For error checking purposes only, it can be helpful to add cudaThreadSynchronize() after every kernel call to narrow down the location of the root cause.

kamsandh · February 1, 2011, 1:51pm

Ye s, I had a pretty long loop inside the kernel , who caused all the trouble . But interesting thing is that when I removed that loop from the kernal and included in host , it gets executed within one second , whereas inside the kernel it took hours to complete .
Thank you for all replies.

pcrs · April 4, 2011, 7:49pm

I have the same problem. I use a atomic add in my loop. If I comment it out, it no longer hangs.
I use this for my atomic add

device float pbatomicAdd(float *address, float value)
{ int oldval, newval, readback;
oldval = __float_as_int(*address);
newval = __float_as_int(__int_as_float(oldval) + value);
while ((readback=atomicCAS((int *)address, oldval, newval)) != oldval) {
oldval = readback;
newval = __float_as_int(__int_as_float(oldval) + value);
}
return __int_as_float(oldval);
}

LSChien · April 5, 2011, 7:30am

I have the same problem. I use a atomic add in my loop. If I comment it out, it no longer hangs.

I use this for my atomic add

device float pbatomicAdd(float *address, float value)

{ int oldval, newval, readback;
oldval = __float_as_int(*address);

newval = __float_as_int(__int_as_float(oldval) + value);

while ((readback=atomicCAS((int *)address, oldval, newval)) != oldval) {

    oldval = readback;

    newval = __float_as_int(__int_as_float(oldval) + value);

}

return __int_as_float(oldval);
}

Are you sure that program hangs or it just needs a LONG time to complete because of penalty of atomic operation.

pcrs · April 6, 2011, 6:26am

I experimented a little bit with what was added in the atomic add and I also had that (on windows 7) my whole driver is restarted.

Now I changed my homegrown atomicAdd(float*, float) to the CP 2.0 official one and that works fine. That did make me discover one of my input numbers is #Nan, maybe that has something to do with it. The 2.0 is also quiet a bit faster than the one using CAS. DO you know if the CAS based is always guarantee to work, also if several threads add the same number. Could they get confirmation of their success from another thread’s add, making some threads stay forever in the while loop?

tera · April 6, 2011, 10:34am

The pbatomicAdd() routine you give is guaranteed to always make progress, as (from the perspective of one specific thread) per every iteration of the loop at least one other thread will successfully finish their addition. So deadlocks are not possible. However there is no fairness, so that newly incoming threads which also update the same variable can potentially stall a thread forever (or for however long there are new updates coming from other threads). This situation is called a livelock.

You should use the above routine only when there is only light contention for the variable. The same probably applies for the built-in atomicAdd, but on a different scale since it is so much faster. If there is strong contention, you should change your algorithm to use a reduction operation, or at least distribute the contention between multiple variables which will be summed later (e.g., use one variable per block in shared memory, and add that back to the global variable only at the end of the block).

pcrs · April 7, 2011, 5:48am

Contention is quiet light, but for this specific problem, I could probably solve it in another way using a reduction. For some other problems reducing it later on gives you a memory bottleneck. Many threads will output mostly zeros requiring a lot of memory before they are all summed.

Topic		Replies	Views
cudaThreadSynchronize() stalls application CUDA Programming and Performance	10	10988	November 17, 2009
atomicCAS issue (possible deadlock) CUDA Programming and Performance	5	3225	October 26, 2011
Question regarding CUDA streams CUDA Programming and Performance	4	2472	May 21, 2009
Really simple while loop issues CUDA Programming and Performance	4	3083	October 27, 2014
Device hangs / freezes / crashes under specific circumstances CUDA Programming and Performance cuda , kernel	5	939	September 1, 2024
strange behavior of kernel-calls CUDA Programming and Performance	2	2437	December 4, 2008
incomprehensible behaviour limitations on kernel calls for host function? CUDA Programming and Performance	12	7031	April 28, 2011
Inexpiable CUDA hang (NOT WDM timeout!) CUDA Programming and Performance	2	1472	June 5, 2014
Kernel can not run parallelly with CPU codes OK in XP, failed in Vista CUDA Programming and Performance	4	8689	December 5, 2008
No Performance Improvement from Overlapping Kernel/Memcpy CUDA Programming and Performance	16	3142	July 14, 2010

Program hangs at cudaThreadsynchronize

Related topics