Double For Loop Very Slow

onepieceking · August 18, 2008, 4:09am

Hi All,

for (int i = 0; i < 100; i++)

{

    kernel1<<<...>>>(...);

    kernel2<<<...>>>(...);

    for (int j = 0; j < 1000; q++)

    {

         kernel3<<<...>>>(...);

     }

}

I am trying something like what the above code is doing. kernel1 and kernel2 will compute some stuff that kernel3 will use. I experienced some slow performance once I go into the 2nd for loop.

Is this expected?

AndreiB · August 18, 2008, 4:14am

You should provide more details on your case, but right now one question: is it okay that you loop over j but increment q in second loop? If j is not modified inside you may loop forever.

_Big_Mac · August 18, 2008, 1:18pm

Also do you sync threads after kernell calls?

tmurray · August 19, 2008, 5:12pm

How are you determining that it’s very slow?

onepieceking · August 20, 2008, 2:48am

Hi,

There is a mistake. It should be a (j++) instead of (q++).

I don’t __syncthreads() after kernel call.

I am comparing between my code with Intel IPP. My kernel calls are written in a way that it should produce similar output as Intel IPP functions.

If I run using IPP, the functions complete very fast. However if i use my code, it is very slow.

I start timing from the start of the 1st for loop and stop once all the loops finish.

_Big_Mac · August 20, 2008, 10:22am

__syncthreads() is an instruction called from a kernel and it means “synchronize all threads in this block”.

I am talking about cudaThreadSynchronize(), a function called in the host code that stands for “synchronize all threads across the grid”.

This is needed because kernels are launched asynchronously, the control is given back to the host right after the kernel is launched, not after it finishes. If you want to call many kernels in sequence, to make sure the previous iteration is complete you should call cudaThreadSynchronize().

for (int i = 0; i < 100; i++)
{
kernel1<<<…>>>(…);
cudaThreadSynchronize();
kernel2<<<…>>>(…);
cudaThreadSynchronize();
for (int j = 0; j < 1000; q++)
{
kernel3<<<…>>>(…);
cudaThreadSynchronize();
}
}

This isn’t needed if the kernels are independent but in your case you need to be sure kernel1 and 2 finish because they compute stuff for kernel 3. And I presume you also want to wait for kernel3 to finish before tasking the GPU with another launch of kernel3 because your data might get mixed up.

AndyL · August 20, 2008, 11:38am

Can you not just call cudaThreadSynchronize() after the Kernel 2 call. That way Kernels 1&2 can run in parallel and then sync when they are complete for Kernel 3?

seibert · August 20, 2008, 11:57am

Even though the kernel calls return asynchronously, they still execute in order on the device (unless you are using the stream API and assign kernels to different streams). You do not need cudaThreadSynchronize() to ensure correct execution ever, as far as I know. The function is primarily interesting for benchmarking purposes.

For similar reasons, memory copies (again, assuming you aren’t assigning the kernel and copy operation to different streams) also run sequentially, which means a cudaMemcpy() will block until the previous kernels are finished.

If some hypothetical future CUDA device can run kernels in parallel, it will only be able do so if the kernels have different stream numbers.

_Big_Mac · August 20, 2008, 1:04pm

I didn’t know that :)

Topic		Replies	Views
Odd Slowdown Problem Same function slows down in loop CUDA Programming and Performance	3	9885	February 8, 2008
Synchronization between Kernel calls CUDA Programming and Performance	2	2748	July 4, 2011
Got wrong result when not using cudaDeviceSynchronize in threads CUDA Programming and Performance	6	855	February 1, 2024
Why the measure time for second kernel is extremely short? CUDA Programming and Performance	5	29	May 13, 2025
Performance hit in CUDA program that calls kernel repeatedly within a for loop CUDA Programming and Performance	2	4008	January 6, 2012
cuda kernel call within for loop gets slow, crashes CUDA Programming and Performance	5	5812	April 1, 2012
Loop inside kernel or over kernels in host code? [performance question] CUDA Programming and Performance	8	6749	September 25, 2008
'for' loop performance hacks? CUDA Programming and Performance	17	10614	February 28, 2009
cudaThreadSynchronize() and multiple kernels when is it necessary to sync? CUDA Programming and Performance	2	8339	June 20, 2008
Calling kernel in a loop spends much time in cudaFree CUDA Programming and Performance	1	771	July 16, 2018

Double For Loop Very Slow

Related topics