CUDA, implicit barrier, sequential point? types of barriers in cuda

monramax · March 14, 2012, 2:48am

**** How many types of barrier (sequential point) does CUDA have?

cudaMemcpy (or similar), explicit barrier
cudaThreadSynchronize, cudaDeviceSynchronize, explicit barrier.
end of kernel function, implicit barrier.

**** If we have a normal function F and we call cuda kernels inside this function, can we assume all kernels are finished end of F?

Thanks.

tera · March 14, 2012, 9:35am

There is no implicit barrier at the end of a kernel function or in a cudaMemcpy(). Instead, kernels, cudaMemcpy()s and cudaMemcpyAsync()s in the same stream are executed sequentially with respect to each other, but (with the exception of cudaMemcpy()) asynchronously to the host code.

So you cannot assume all kernels have finished at the end of a host function (which by the way has not special meaning to CUDA), you need to explicitly put a cudaStreamSynchronize() or cudaDeviceSynchronize() there.

monramax · March 14, 2012, 4:36pm

Thanks, tera.

For multi-GPUs applications using OpenMP, how to synchronize among those GPUs? Any sugguestions?

Like,

{

…

cudaDeviceSynchronize();

#pragma omp barrier

…

}

Thanks.

tera · March 14, 2012, 10:10pm

Yes, that looks like a good barrier assuming you run CUDA 4.1, each device is handled by one OpenMP thread, and you want to sync all devices and all threads. If you just want to sync the GPUs without caring about the host threads, you can use [font=“Courier New”]cudaStreamWaitEvent()[/font].

monramax · March 15, 2012, 12:05am

Thanks, tera.

Sometimes we have to sync all devices and all threads, like a simple dot product. As you said CUDA 4.1, I am using v4.0 for multi-GPUs/OpenMP, are there any features that v4.0 doesn’t support? Thanks.

tera · March 15, 2012, 12:27pm

Sorry, I mixed up versions there. The OpenMP improvements are already in 4.0, no need to upgrade to 4.1 for that.

monramax · March 15, 2012, 7:33pm

Thanks, tera.

I see. I found new features in 4.1. I upgraded to 4.1. Thanks again for your help.

pasoleatis · March 17, 2012, 8:31am

I am a little bit confused, but if the there are no implcit barrier when cudaMemcpy is executed it means that I woud have to call some other sync command to be sure that the data arrived on the host. The cudamemcpy is a blocking command, nothing can be executed in the code until this command is done, so it does act as a barrier.

This means it should work somethinglike this:

kernel<<<>>>;

cudamemcpy()

mpi_barrier()

and it will have all processes stop at the barrier and all the kernels will be executed by the time the cudamemcpy is finished.

The cudaMemcpy blocks also all execution of the code also in streams, so they should never be used with stremas, one should rather use cudaMemcpyAsync() with the streams.