CUDA, implicit barrier, sequential point? types of barriers in cuda

**** How many types of barrier (sequential point) does CUDA have?

  1. cudaMemcpy (or similar), explicit barrier

  2. cudaThreadSynchronize, cudaDeviceSynchronize, explicit barrier.

  3. end of kernel function, implicit barrier.

**** If we have a normal function F and we call cuda kernels inside this function, can we assume all kernels are finished end of F?


There is no implicit barrier at the end of a kernel function or in a cudaMemcpy(). Instead, kernels, cudaMemcpy()s and cudaMemcpyAsync()s in the same stream are executed sequentially with respect to each other, but (with the exception of cudaMemcpy()) asynchronously to the host code.

So you cannot assume all kernels have finished at the end of a host function (which by the way has not special meaning to CUDA), you need to explicitly put a cudaStreamSynchronize() or cudaDeviceSynchronize() there.

Thanks, tera.

For multi-GPUs applications using OpenMP, how to synchronize among those GPUs? Any sugguestions?




#pragma omp barrier



Yes, that looks like a good barrier assuming you run CUDA 4.1, each device is handled by one OpenMP thread, and you want to sync all devices and all threads. If you just want to sync the GPUs without caring about the host threads, you can use [font=“Courier New”]cudaStreamWaitEvent()[/font].

Thanks, tera.

Sometimes we have to sync all devices and all threads, like a simple dot product. As you said CUDA 4.1, I am using v4.0 for multi-GPUs/OpenMP, are there any features that v4.0 doesn’t support? Thanks.

Sorry, I mixed up versions there. The OpenMP improvements are already in 4.0, no need to upgrade to 4.1 for that.

Thanks, tera.

I see. I found new features in 4.1. I upgraded to 4.1. Thanks again for your help.

I am a little bit confused, but if the there are no implcit barrier when cudaMemcpy is executed it means that I woud have to call some other sync command to be sure that the data arrived on the host. The cudamemcpy is a blocking command, nothing can be executed in the code until this command is done, so it does act as a barrier.

This means it should work somethinglike this:




and it will have all processes stop at the barrier and all the kernels will be executed by the time the cudamemcpy is finished.

The cudaMemcpy blocks also all execution of the code also in streams, so they should never be used with stremas, one should rather use cudaMemcpyAsync() with the streams.