cuStreamSynchronize killing the performance

efblack2 · March 31, 2016, 1:07pm

Hi all,

I hope this is the right forum for asking this question.

I am comparing two versions of the same application; one in cuda and the other in opanAcc.

The cuda one is much faster (4X) than the openACC. Looking at the output of nvprof I can see that the openACC version is being hurt by a large (implicit) calling count of the cuStreamSynchronize function. Here is a partial output of nvprof:

==23128== Profiling application: svdOpenAcc …/matrices/300by300.dat
==23128== Profiling result:
Time(%) Time Calls Avg Min Max Name
68.51% 15.8233s 1210950 13.066us 10.144us 17.441us preM_7_gpu
18.58% 4.29080s 403650 10.630us 10.304us 11.776us posM_7_gpu
6.61% 1.52668s 403650 3.7820us 3.6480us 7.9680us diagonal_8_gpu
6.30% 1.45464s 403650 3.6030us 3.4880us 7.7120us symmetric_8_gpu
0.01% 1.2695ms 11 115.40us 114.85us 116.03us [CUDA memcpy DtoH]
0.00% 350.28us 3 116.76us 116.00us 117.22us [CUDA memcpy HtoD]

==23128== API calls:
Time(%) Time Calls Avg Min Max Name
81.60% 35.8985s 2421911 14.822us 579ns 567.15ms cuStreamSynchronize
18.17% 7.99463s 2421900 3.3000us 2.7120us 4.0235ms cuLaunchKernel
0.10% 46.055ms 1 46.055ms 46.055ms 46.055ms cuDevicePrimaryCtxRetain
0.08% 33.490ms 1 33.490ms 33.490ms 33.490ms cuDevicePrimaryCtxRelease
0.03% 12.822ms 1 12.822ms 12.822ms 12.822ms cuMemHostAlloc
0.02% 8.2991ms 1 8.2991ms 8.2991ms 8.2991ms cuMemFreeHost
.
.
.
.

If I subtract those 35.9 second taken by cuStreamSynchronize the openACC version will match the time taken by the cuda version. The cuda version of the application do not have a call to that cuStreamSynchronize function.

My question is: how can I turn off this implicit call to cuStreamSynchronize? is this possible?

Thanks in advance for your answers.

MatColgrove · March 31, 2016, 7:21pm

Hi efblack2,

The cuStreamSynchronize time is on the host side where the host is blocking while your code is running of the device. In other words, this time is concurrent to the what’s running on the GPU and shouldn’t be summed with the GPU time. Instead, compare the times for each kernel as well as the total wall clock time.

This is more evident when looking at a timeline in NVVP rather than nvprof.

To remove cuStreamSynchronize, you could use the OpenACC “async” clauses so that the host doesn’t block waiting for the device code to finish.

Hope this helps,
Mat

efblack2 · April 1, 2016, 1:41pm

Hi Mat,

Thanks. I read about async and use it. It helps, but the openACC version is still a lot slower than the Cuda version by a factor of about 3.

The number and size of the memory transfers are equal in both versions.

Currently I am trying to understand why this is happening.

Thanks again.

MatColgrove · April 1, 2016, 3:54pm

Are you able to compare the times of the individual kernels?

Some things to look at:

Registry usage which impacts occupancy (See output from -ta=tesla:ptxinfo which can be adjusted with -ta=tesla:maxregcount:)
Schedule (blocks/threads which translate to gang/worker/vector)
Memory access in a kernel (non-contiguous data access across threads in warp can kill performance, make sure the stride-1 dimension corresponds to the “vector” loop)
Constant memory isn’t used in PGI OpenACC so CUDA C has the advantage.
Texture memory is used by PGI OpenACC when “-ta=telsa:cc35” is used and the compiler can determine data is read-only. (Fortran use “INTENT(IN)” and in C use “restrict” to help)
If you are using “routine worker/vector” there’s a known performance issue in 16.1 which was addressed in 16.3.

Note when using “async”, make sure that no data, including a reduction variable, is being copied back to the host at the end of the compute region. This will cause the host to block waiting for the copy.

Let me know if you need help analyzing the profiles.

Mat

efblack2 · April 4, 2016, 5:03pm

Hi Matt,

Thank you for your advice.

Last Friday, in addition to async, I started playing with num_gangs() and vector_length() and I was able to improve the openACC version of my program a bit. However it is still behind the cuda version by ~1.5 times.

I noticed that (using the Nvidia Server Setting) that the GPU utilization when the openACC is running is about 65% while the GPU utilization when the cuda version runs is 99%.

I started to look at the registers count. The number of registers is consistently larger in the openACC version. I am trying to figure it out why and if I can reduce it.

FYI, I am using pgcc 16.3-0 64-bit target on x86-64 Linux -tp haswell

Thanks again. I will post any progress.

MatColgrove · April 5, 2016, 3:25pm

You can try reducing the register count via the “-ta=tesla:maxregcount:” flag.

Mat

efblack2 · April 5, 2016, 9:52pm

I played for a while with the register count. It seems to have no effect in the performance.

One thing I notice is that the openACC version of my program run a little faster if it is being profiled with nprof. Have any one notice such behavior?

Topic		Replies	Views
90% time taken by cuStreamSynchronize Legacy PGI Compilers	2	3500	October 14, 2015
cudaStreamSynchronize is much slower than polling on a flag for kernel completion CUDA Programming and Performance cuda , synchronization	8	2125	February 16, 2023
performance problem CUDA Programming and Performance	2	611	July 16, 2018
I am curious about why some program become faster when calling cudaStreamSynchronize during program running. CUDA Programming and Performance	3	784	January 21, 2019
cudaMemcpyAsync makes code faster even when using the default stream 0 CUDA Programming and Performance	1	1471	January 10, 2022
cuStreamWaitValue32 and cuStreamWriteValue32 blocking issue CUDA Programming and Performance	8	472	April 12, 2024
cudaMemcpyAsync CUDA Programming and Performance	10	20849	October 16, 2015
massive hiccups when transferring flags back to host challenging the commonly quoted 2-10us latency CUDA Programming and Performance	14	38117	February 1, 2011
Help in speeding up cuLaunchKernel execution time CUDA Programming and Performance	11	877	October 28, 2022
cudaMemcpyAsync not "async" in cuda 3.1 cudaMemcpyAsync blocking cuda 3.1 CUDA Programming and Performance	7	1937	July 12, 2010

cuStreamSynchronize killing the performance

Related topics