Double For Loop Very Slow

Hi All,

for (int i = 0; i < 100; i++)

{

    kernel1<<<...>>>(...);

    kernel2<<<...>>>(...);

    for (int j = 0; j < 1000; q++)

    {

         kernel3<<<...>>>(...);

     }

}

I am trying something like what the above code is doing. kernel1 and kernel2 will compute some stuff that kernel3 will use. I experienced some slow performance once I go into the 2nd for loop.

Is this expected?

You should provide more details on your case, but right now one question: is it okay that you loop over j but increment q in second loop? If j is not modified inside you may loop forever.

Also do you sync threads after kernell calls?

How are you determining that it’s very slow?

Hi,

There is a mistake. It should be a (j++) instead of (q++).

I don’t __syncthreads() after kernel call.

I am comparing between my code with Intel IPP. My kernel calls are written in a way that it should produce similar output as Intel IPP functions.

If I run using IPP, the functions complete very fast. However if i use my code, it is very slow.

I start timing from the start of the 1st for loop and stop once all the loops finish.

__syncthreads() is an instruction called from a kernel and it means “synchronize all threads in this block”.

I am talking about cudaThreadSynchronize(), a function called in the host code that stands for “synchronize all threads across the grid”.

This is needed because kernels are launched asynchronously, the control is given back to the host right after the kernel is launched, not after it finishes. If you want to call many kernels in sequence, to make sure the previous iteration is complete you should call cudaThreadSynchronize().

for (int i = 0; i < 100; i++)
{
kernel1<<<…>>>(…);
cudaThreadSynchronize();
kernel2<<<…>>>(…);
cudaThreadSynchronize();
for (int j = 0; j < 1000; q++)
{
kernel3<<<…>>>(…);
cudaThreadSynchronize();
}
}

This isn’t needed if the kernels are independent but in your case you need to be sure kernel1 and 2 finish because they compute stuff for kernel 3. And I presume you also want to wait for kernel3 to finish before tasking the GPU with another launch of kernel3 because your data might get mixed up.

Can you not just call cudaThreadSynchronize() after the Kernel 2 call. That way Kernels 1&2 can run in parallel and then sync when they are complete for Kernel 3?

Even though the kernel calls return asynchronously, they still execute in order on the device (unless you are using the stream API and assign kernels to different streams). You do not need cudaThreadSynchronize() to ensure correct execution ever, as far as I know. The function is primarily interesting for benchmarking purposes.

For similar reasons, memory copies (again, assuming you aren’t assigning the kernel and copy operation to different streams) also run sequentially, which means a cudaMemcpy() will block until the previous kernels are finished.

If some hypothetical future CUDA device can run kernels in parallel, it will only be able do so if the kernels have different stream numbers.

I didn’t know that :)