cuStreamSynchronize killing the performance

Hi all,

I hope this is the right forum for asking this question.

I am comparing two versions of the same application; one in cuda and the other in opanAcc.

The cuda one is much faster (4X) than the openACC. Looking at the output of nvprof I can see that the openACC version is being hurt by a large (implicit) calling count of the cuStreamSynchronize function. Here is a partial output of nvprof:

==23128== Profiling application: svdOpenAcc …/matrices/300by300.dat
==23128== Profiling result:
Time(%) Time Calls Avg Min Max Name
68.51% 15.8233s 1210950 13.066us 10.144us 17.441us preM_7_gpu
18.58% 4.29080s 403650 10.630us 10.304us 11.776us posM_7_gpu
6.61% 1.52668s 403650 3.7820us 3.6480us 7.9680us diagonal_8_gpu
6.30% 1.45464s 403650 3.6030us 3.4880us 7.7120us symmetric_8_gpu
0.01% 1.2695ms 11 115.40us 114.85us 116.03us [CUDA memcpy DtoH]
0.00% 350.28us 3 116.76us 116.00us 117.22us [CUDA memcpy HtoD]

==23128== API calls:
Time(%) Time Calls Avg Min Max Name
81.60% 35.8985s 2421911 14.822us 579ns 567.15ms cuStreamSynchronize
18.17% 7.99463s 2421900 3.3000us 2.7120us 4.0235ms cuLaunchKernel
0.10% 46.055ms 1 46.055ms 46.055ms 46.055ms cuDevicePrimaryCtxRetain
0.08% 33.490ms 1 33.490ms 33.490ms 33.490ms cuDevicePrimaryCtxRelease
0.03% 12.822ms 1 12.822ms 12.822ms 12.822ms cuMemHostAlloc
0.02% 8.2991ms 1 8.2991ms 8.2991ms 8.2991ms cuMemFreeHost

If I subtract those 35.9 second taken by cuStreamSynchronize the openACC version will match the time taken by the cuda version. The cuda version of the application do not have a call to that cuStreamSynchronize function.

My question is: how can I turn off this implicit call to cuStreamSynchronize? is this possible?

Thanks in advance for your answers.

Hi efblack2,

The cuStreamSynchronize time is on the host side where the host is blocking while your code is running of the device. In other words, this time is concurrent to the what’s running on the GPU and shouldn’t be summed with the GPU time. Instead, compare the times for each kernel as well as the total wall clock time.

This is more evident when looking at a timeline in NVVP rather than nvprof.

To remove cuStreamSynchronize, you could use the OpenACC “async” clauses so that the host doesn’t block waiting for the device code to finish.

Hope this helps,

Hi Mat,

Thanks. I read about async and use it. It helps, but the openACC version is still a lot slower than the Cuda version by a factor of about 3.

The number and size of the memory transfers are equal in both versions.

Currently I am trying to understand why this is happening.

Thanks again.

Are you able to compare the times of the individual kernels?

Some things to look at:

  • Registry usage which impacts occupancy (See output from -ta=tesla:ptxinfo which can be adjusted with -ta=tesla:maxregcount:)
    Schedule (blocks/threads which translate to gang/worker/vector)
    Memory access in a kernel (non-contiguous data access across threads in warp can kill performance, make sure the stride-1 dimension corresponds to the “vector” loop)
    Constant memory isn’t used in PGI OpenACC so CUDA C has the advantage.
    Texture memory is used by PGI OpenACC when “-ta=telsa:cc35” is used and the compiler can determine data is read-only. (Fortran use “INTENT(IN)” and in C use “restrict” to help)
    If you are using “routine worker/vector” there’s a known performance issue in 16.1 which was addressed in 16.3.

Note when using “async”, make sure that no data, including a reduction variable, is being copied back to the host at the end of the compute region. This will cause the host to block waiting for the copy.

Let me know if you need help analyzing the profiles.

  • Mat

Hi Matt,

Thank you for your advice.

Last Friday, in addition to async, I started playing with num_gangs() and vector_length() and I was able to improve the openACC version of my program a bit. However it is still behind the cuda version by ~1.5 times.

I noticed that (using the Nvidia Server Setting) that the GPU utilization when the openACC is running is about 65% while the GPU utilization when the cuda version runs is 99%.

I started to look at the registers count. The number of registers is consistently larger in the openACC version. I am trying to figure it out why and if I can reduce it.

FYI, I am using pgcc 16.3-0 64-bit target on x86-64 Linux -tp haswell

Thanks again. I will post any progress.

You can try reducing the register count via the “-ta=tesla:maxregcount:” flag.

  • Mat

I played for a while with the register count. It seems to have no effect in the performance.

One thing I notice is that the openACC version of my program run a little faster if it is being profiled with nprof. Have any one notice such behavior?