I have developed an application which uses pthreads to asynchronously launch the same 2 kernels on 8 Volta GPUs simultaneously using 11.0 SDK. It compiles and runs in both Windows 10 and Ubuntu 18.04 with correct numerical results. Every pair of kernel launches uses a separate stream with appropriate cudaStreamSynchronize commands.
When run from the command line, 8x GPUs is actually slightly slower than a single GPU run. That slow down is apparently due to serial execution of the 8 GPU kernels and nvprof shows those kernels taking the same amount of time with 8 GPUs as it does with one GPU. When I profile it in Windows with NVVP, it profiles as running 4x faster (cuda clock) and the profile shows the 8 GPU kernels executing simultaneously as designed.
Can anyone point out why this cmd line slow down is happening or how to get equivalent performance of parallel execution from the cmd line launch?
What is different about executing in NVVP wrt serial vs parallel GPU execution?