Synchronization methods?

Hello,

I’m wondering about performance implications of various synchronization methods. To be precise, say you have two operations which need to be performed repeatedly and in succession with each one depending on the entire set of results from the previous one as follows…

void worker(,,,)

{

int i;

int n; /* total number of iterations, say 1e6.  */

for(i=0; i<n; i++) {

    operation1(,,,);

    operation2(,,,);

}

}

As far as I can understand there are numerous ways to do this using CUDA.

A: Using synchronous kernel launches… [EDIT: Not synchronous, my mistake.]

__global__ void operation1(,,,)

{

operation1_per_element(,,,);

}

__global__ void operation2(,,,)

{

operation2_per_element(,,,);

}

void worker(,,,)

{

int i;

int n; /* total number of iterations, say 1e6.  */

for(i=0; i<n; i++) {

    operation1<<<dimGrid, dimBlock>>>(,,,);

    operation2<<<dimGrid, dimBlock>>>(,,,);

}

}

B: Using async kernel launches. Useful if you need to overlay device-host transfers.

__global__ void operation1(,,,)

{

operation1_per_element(,,,);

}

__global__ void operation2(,,,)

{

operation2_per_element(,,,);

}

void worker(,,,)

{

int i;

int n; /* total number of iterations, say 1e6.  */

cudaStream_t stream[N_STREAMS]; /* arbitrary number of streams.  */

for (i=0; i<N_STREAMS; i++)

  cudaStreamCreate(&stream[i]);

for(i=0; i<n; i++) {

    operation1<<<dimGrid, dimBlock, 0, stream[1]>>>(,,,);

    operation2<<<dimGrid, dimBlock, 0, stream[1]>>>(,,,);

}

}

C: As B but with cudaStreamSynchronize in order to prevent the for loop running away on the CPU.

__global__ void operation1(,,,)

{

operation1_per_element(,,,);

}

__global__ void operation2(,,,)

{

operation2_per_element(,,,);

}

void worker(,,,)

{

int i;

int n; /* total number of iterations, say 1e6.  */

cudaStream_t stream[N_STREAMS]; /* arbitrary number of streams.  */

for (i=0; i<N_STREAMS; i++)

  cudaStreamCreate(&stream[i]);

for(i=0; i<n; i++) {

    operation1<<<dimGrid, dimBlock, 0, stream[1]>>>(,,,);

    operation2<<<dimGrid, dimBlock, 0, stream[1]>>>(,,,);

    cudaStreamSynchronize(stream[1]);

}

}

D: As C but with less overall cudaStreamSynchronize…

__global__ void operation1(,,,)

{

operation1_per_element(,,,);

}

__global__ void operation2(,,,)

{

operation2_per_element(,,,);

}

void worker(,,,)

{

int i;

int n; /* total number of iterations, say 1e6.  */

cudaStream_t stream[N_STREAMS]; /* arbitrary number of streams.  */

for (i=0; i<N_STREAMS; i++)

  cudaStreamCreate(&stream[i]);

for(i=0; i<(n/4); i++) {

    operation1<<<dimGrid, dimBlock, 0, stream[1]>>>(,,,);

    operation2<<<dimGrid, dimBlock, 0, stream[1]>>>(,,,);

    operation1<<<dimGrid, dimBlock, 0, stream[1]>>>(,,,);

    operation2<<<dimGrid, dimBlock, 0, stream[1]>>>(,,,);

    operation1<<<dimGrid, dimBlock, 0, stream[1]>>>(,,,);

    operation2<<<dimGrid, dimBlock, 0, stream[1]>>>(,,,);

    operation1<<<dimGrid, dimBlock, 0, stream[1]>>>(,,,);

    operation2<<<dimGrid, dimBlock, 0, stream[1]>>>(,,,);

    cudaStreamSynchronize(stream[1]);

}

}

E: Looping within the kernel instead. Having some sort of monolythic kernel and using threadfence… [EDIT: Doesn’t work. See replies.]

__global__ void monolythic_kernel(int n,,,)

{

int i;

for(i=0; i<n; i++) {

    operation1_per_element(,,,);

    __threadfence();

    operation2_per_element(,,,);

    __threadfence();

}

}

void worker(,,,)

{

int i;

int n; /* total number of iterations, say 1e6.  */

monolythic_kernel<<<dimGrid, dimBlock, 0, stream[1]>>>(n,,,);

}

NOTE: Obviously all incomplete code…

Section 2.1.1 of “CUDA C Best Practices Guide Version 3.2”

I understand that synchronization points such as these cause the thread scheduler to stall and thus stop being able to hide memory access latency.

MY QUESTION IS THIS:

Does synchronizing to the host (CPU) using cudaStreamSynchronize, cudaThreadynchronize or synchronous launches have any more performance impact than launching a sequence of kernels in the same stream asynchronously. In other words, what is better? B, C or D?

Rephrasing yet again; Is it any worse to synchronize to the host AND launch a new kernel asynchronously than to just launch the new kernel without syncing to the host?

Looking forward to hearing your responses…

Hello,

I’m wondering about performance implications of various synchronization methods. To be precise, say you have two operations which need to be performed repeatedly and in succession with each one depending on the entire set of results from the previous one as follows…

void worker(,,,)

{

int i;

int n; /* total number of iterations, say 1e6.  */

for(i=0; i<n; i++) {

    operation1(,,,);

    operation2(,,,);

}

}

As far as I can understand there are numerous ways to do this using CUDA.

A: Using synchronous kernel launches… [EDIT: Not synchronous, my mistake.]

__global__ void operation1(,,,)

{

operation1_per_element(,,,);

}

__global__ void operation2(,,,)

{

operation2_per_element(,,,);

}

void worker(,,,)

{

int i;

int n; /* total number of iterations, say 1e6.  */

for(i=0; i<n; i++) {

    operation1<<<dimGrid, dimBlock>>>(,,,);

    operation2<<<dimGrid, dimBlock>>>(,,,);

}

}

B: Using async kernel launches. Useful if you need to overlay device-host transfers.

__global__ void operation1(,,,)

{

operation1_per_element(,,,);

}

__global__ void operation2(,,,)

{

operation2_per_element(,,,);

}

void worker(,,,)

{

int i;

int n; /* total number of iterations, say 1e6.  */

cudaStream_t stream[N_STREAMS]; /* arbitrary number of streams.  */

for (i=0; i<N_STREAMS; i++)

  cudaStreamCreate(&stream[i]);

for(i=0; i<n; i++) {

    operation1<<<dimGrid, dimBlock, 0, stream[1]>>>(,,,);

    operation2<<<dimGrid, dimBlock, 0, stream[1]>>>(,,,);

}

}

C: As B but with cudaStreamSynchronize in order to prevent the for loop running away on the CPU.

__global__ void operation1(,,,)

{

operation1_per_element(,,,);

}

__global__ void operation2(,,,)

{

operation2_per_element(,,,);

}

void worker(,,,)

{

int i;

int n; /* total number of iterations, say 1e6.  */

cudaStream_t stream[N_STREAMS]; /* arbitrary number of streams.  */

for (i=0; i<N_STREAMS; i++)

  cudaStreamCreate(&stream[i]);

for(i=0; i<n; i++) {

    operation1<<<dimGrid, dimBlock, 0, stream[1]>>>(,,,);

    operation2<<<dimGrid, dimBlock, 0, stream[1]>>>(,,,);

    cudaStreamSynchronize(stream[1]);

}

}

D: As C but with less overall cudaStreamSynchronize…

__global__ void operation1(,,,)

{

operation1_per_element(,,,);

}

__global__ void operation2(,,,)

{

operation2_per_element(,,,);

}

void worker(,,,)

{

int i;

int n; /* total number of iterations, say 1e6.  */

cudaStream_t stream[N_STREAMS]; /* arbitrary number of streams.  */

for (i=0; i<N_STREAMS; i++)

  cudaStreamCreate(&stream[i]);

for(i=0; i<(n/4); i++) {

    operation1<<<dimGrid, dimBlock, 0, stream[1]>>>(,,,);

    operation2<<<dimGrid, dimBlock, 0, stream[1]>>>(,,,);

    operation1<<<dimGrid, dimBlock, 0, stream[1]>>>(,,,);

    operation2<<<dimGrid, dimBlock, 0, stream[1]>>>(,,,);

    operation1<<<dimGrid, dimBlock, 0, stream[1]>>>(,,,);

    operation2<<<dimGrid, dimBlock, 0, stream[1]>>>(,,,);

    operation1<<<dimGrid, dimBlock, 0, stream[1]>>>(,,,);

    operation2<<<dimGrid, dimBlock, 0, stream[1]>>>(,,,);

    cudaStreamSynchronize(stream[1]);

}

}

E: Looping within the kernel instead. Having some sort of monolythic kernel and using threadfence… [EDIT: Doesn’t work. See replies.]

__global__ void monolythic_kernel(int n,,,)

{

int i;

for(i=0; i<n; i++) {

    operation1_per_element(,,,);

    __threadfence();

    operation2_per_element(,,,);

    __threadfence();

}

}

void worker(,,,)

{

int i;

int n; /* total number of iterations, say 1e6.  */

monolythic_kernel<<<dimGrid, dimBlock, 0, stream[1]>>>(n,,,);

}

NOTE: Obviously all incomplete code…

Section 2.1.1 of “CUDA C Best Practices Guide Version 3.2”

I understand that synchronization points such as these cause the thread scheduler to stall and thus stop being able to hide memory access latency.

MY QUESTION IS THIS:

Does synchronizing to the host (CPU) using cudaStreamSynchronize, cudaThreadynchronize or synchronous launches have any more performance impact than launching a sequence of kernels in the same stream asynchronously. In other words, what is better? B, C or D?

Rephrasing yet again; Is it any worse to synchronize to the host AND launch a new kernel asynchronously than to just launch the new kernel without syncing to the host?

Looking forward to hearing your responses…

hey, your E implementation may well be buggy… in depends on the kind of data dependence between operation1 and operation2.

a threadfence() does not implies a global synchronization of the GPU, which by the way may be impossible in certain combinations of grid/block sizes and device model!

hey, your E implementation may well be buggy… in depends on the kind of data dependence between operation1 and operation2.

a threadfence() does not implies a global synchronization of the GPU, which by the way may be impossible in certain combinations of grid/block sizes and device model!

There’s no difference between letting the for loop run and the cudaThreadSynchronize()/cudaStreamSynchronize() cases.

Also global sync within the kernel isn’t supported.

There’s no difference between letting the for loop run and the cudaThreadSynchronize()/cudaStreamSynchronize() cases.

Also global sync within the kernel isn’t supported.

I appreciate the response and I’m not disagreeing with you but can you elaborate further? I have to admit I am yet to use __threadfence but I don’t understand what I have got wrong?

CUDA Programming Guide 3.2 (Appendix B.5)

I appreciate the response and I’m not disagreeing with you but can you elaborate further? I have to admit I am yet to use __threadfence but I don’t understand what I have got wrong?

CUDA Programming Guide 3.2 (Appendix B.5)

__threadfence() is a memory barrier, not an execution barrier. It guarantees that once your thread passes that line, all pending memory transactions have been completed and flushed far enough up the memory hierarchy to be visible to other threads. (In the case of compute capability 2.x, that would be up to the L2 cache level.) It does not, however, guarantee that any other thread has also reached the __threadfence() line at the same time. So if you write to a global memory location in block 0 and then call __threadfence(), you can still have a race condition if a thread in block 2 tries to read that same location because that thread could have gotten past the __threadfence() first, before the write from block 0 was finished.

__syncthreads() is an execution barrier, because it forces all threads in a block to stop and wait for the rest to arrive at the same sync point before any thread can progress.

Note your case “A” is asynchronous still, unless you have the CUDA profiler or debugging turned on. Kernel launches return as soon as they are queued by the driver. If the queue is full (something like 24 launches deep), then the host will wait until a spot opens in the queue to return. Unless your kernels are very short, the launch overhead is usually negligible (a function of the number of blocks, but on the order of tens of microseconds) and faster on Fermi GPUs. Unless you have some special need to keep computing on the host while the GPU works, fire-and-forget a sequence of kernels is the least painful approach to achieve a series of global execution barriers.

__threadfence() is a memory barrier, not an execution barrier. It guarantees that once your thread passes that line, all pending memory transactions have been completed and flushed far enough up the memory hierarchy to be visible to other threads. (In the case of compute capability 2.x, that would be up to the L2 cache level.) It does not, however, guarantee that any other thread has also reached the __threadfence() line at the same time. So if you write to a global memory location in block 0 and then call __threadfence(), you can still have a race condition if a thread in block 2 tries to read that same location because that thread could have gotten past the __threadfence() first, before the write from block 0 was finished.

__syncthreads() is an execution barrier, because it forces all threads in a block to stop and wait for the rest to arrive at the same sync point before any thread can progress.

Note your case “A” is asynchronous still, unless you have the CUDA profiler or debugging turned on. Kernel launches return as soon as they are queued by the driver. If the queue is full (something like 24 launches deep), then the host will wait until a spot opens in the queue to return. Unless your kernels are very short, the launch overhead is usually negligible (a function of the number of blocks, but on the order of tens of microseconds) and faster on Fermi GPUs. Unless you have some special need to keep computing on the host while the GPU works, fire-and-forget a sequence of kernels is the least painful approach to achieve a series of global execution barriers.

Thanks x10 for your informative reply seibert.

Indeed, I can’t remember what gave me the incorrect idea that kernels launched with the default stream were synchronous. Also, thanks for your explanation of why it is impossible to synchronize global memory access across all threads within a kernel.

I’m still not 100% clear on the original question though. To complicate matters, I’ll give you a few details about my code. I have 3 streams which do the same operation on unconnected data. Each operation consists of around 6 kernel launches. The reason I have split the domain into 3 streams is to overlay some host-device and device-host transfers during the execution of the 2nd and 3rd stream. The final stage in the algorithm is an operation that requires the calculations from all 3 streams (ie. the whole domain) to be complete so I use cudaStreamSynchronize(stream[…]) on 2 streams and launch the remaining kernels using the 3rd stream.

The algorithm above completes one time step of a CFD code that goes on for millions of time steps. At the end of each time step, I need to use use cudaStreamSynchronize(stream[3]) because the last step needs to be complete before calculations for the next time step can begin. Basically, if I remove each of the cudaStreamSynchronize calls, as expected, my code doesn’t produce meaningful results anymore but it runs around 40% faster!!! My initial thought was that this indicated the code was waiting for the device-host and host-device transfers to complete. I had a similar result however if I also commented out all the transfers leading to much confusion on my part. Although, at this stage significant sections of the code were missing and it was totally non-functional.

I know that I can’t expect to get much useful advice regarding my code if I only give sketchy details like above and neglect the many intricacies about domain size, arithmatic intensity, memory throughput, etc. Thats why I didn’t want to ask a question about my code but instead tried to make a heavily simplified example.

Basically, I’m wondering whether I should totally redo my algorithm to remove as many cudaStreamSynchronize as possible (hopefully all)?

Can simply calling cudaStreamSynchronize in between kernal launches cause a 40% reduction in speed?

Will [C] be 40% faster than [B] due to calling cudaStreamSynchronize alone?

Does using cudaStreamSynchronize add any extra overhead to the next kernel launch?

Why does nvidia say using cudaStreamSynchronize will “imply a stall in the GPU’s processing pipeline”. Will not each kernel launch do the same?

Thanks again for your help so far…

Thanks x10 for your informative reply seibert.

Indeed, I can’t remember what gave me the incorrect idea that kernels launched with the default stream were synchronous. Also, thanks for your explanation of why it is impossible to synchronize global memory access across all threads within a kernel.

I’m still not 100% clear on the original question though. To complicate matters, I’ll give you a few details about my code. I have 3 streams which do the same operation on unconnected data. Each operation consists of around 6 kernel launches. The reason I have split the domain into 3 streams is to overlay some host-device and device-host transfers during the execution of the 2nd and 3rd stream. The final stage in the algorithm is an operation that requires the calculations from all 3 streams (ie. the whole domain) to be complete so I use cudaStreamSynchronize(stream[…]) on 2 streams and launch the remaining kernels using the 3rd stream.

The algorithm above completes one time step of a CFD code that goes on for millions of time steps. At the end of each time step, I need to use use cudaStreamSynchronize(stream[3]) because the last step needs to be complete before calculations for the next time step can begin. Basically, if I remove each of the cudaStreamSynchronize calls, as expected, my code doesn’t produce meaningful results anymore but it runs around 40% faster!!! My initial thought was that this indicated the code was waiting for the device-host and host-device transfers to complete. I had a similar result however if I also commented out all the transfers leading to much confusion on my part. Although, at this stage significant sections of the code were missing and it was totally non-functional.

I know that I can’t expect to get much useful advice regarding my code if I only give sketchy details like above and neglect the many intricacies about domain size, arithmatic intensity, memory throughput, etc. Thats why I didn’t want to ask a question about my code but instead tried to make a heavily simplified example.

Basically, I’m wondering whether I should totally redo my algorithm to remove as many cudaStreamSynchronize as possible (hopefully all)?

Can simply calling cudaStreamSynchronize in between kernal launches cause a 40% reduction in speed?

Will [C] be 40% faster than [B] due to calling cudaStreamSynchronize alone?

Does using cudaStreamSynchronize add any extra overhead to the next kernel launch?

Why does nvidia say using cudaStreamSynchronize will “imply a stall in the GPU’s processing pipeline”. Will not each kernel launch do the same?

Thanks again for your help so far…