8x GPU app profiles parallel GPU kernel exec in NVVP, but kernels exec serial from cmd line

I have developed an application which uses pthreads to asynchronously launch the same 2 kernels on 8 Volta GPUs simultaneously using 11.0 SDK. It compiles and runs in both Windows 10 and Ubuntu 18.04 with correct numerical results. Every pair of kernel launches uses a separate stream with appropriate cudaStreamSynchronize commands.

When run from the command line, 8x GPUs is actually slightly slower than a single GPU run. That slow down is apparently due to serial execution of the 8 GPU kernels and nvprof shows those kernels taking the same amount of time with 8 GPUs as it does with one GPU. When I profile it in Windows with NVVP, it profiles as running 4x faster (cuda clock) and the profile shows the 8 GPU kernels executing simultaneously as designed.

Can anyone point out why this cmd line slow down is happening or how to get equivalent performance of parallel execution from the cmd line launch?

What is different about executing in NVVP wrt serial vs parallel GPU execution?

This is the code used to launch kernels on multiple GPUs:

	for(int ndevice = 0; ndevice < nDevices; ndevice++)
		if (pthread_create(&threads[ndevice], NULL, kernelLaunchThread, params[ndevice]))
			fprintf(stderr, "Error creating thread %d\n", ndevice);
			return ndevice +1;
	// wait for all GPU devices to finish pulse npulse
	for (int ndevice = 0; ndevice < nDevices; ndevice++)
		// wait for last kernel thread
		if(pthread_join(threads[ndevice], NULL))
			fprintf(stderr, "Error joining thread %d\n", ndevice);
			return ndevice + 1;
		int strm = npulse * nDevices + ndevice;

@Greg @Robert_Crovella @njuffa

Any idea why an app’s multi-GPU parallelism would not execute as shown by NVVP when launched from a cmd line?

It has occurred to me that different CPU threads setting the CUDA device may be blocking other CPU threads until the first CPU thread is finished, and that somehow NVVP prevents such cross thread execution block, or least NVVP does not profile such blocking correctly.

Because the numerical output is accurate, I know that multiple CPU threads switching the GPU device, does not interfere with the processing. I just wonder if the system has to block other thread execution to make that work correctly.

Will setting the device before launching a thread be picked up by the thread for its lifetime? Or does setting a different device in the main thread affect all running threads?

Is there a way for different CPU threads to set different CUDA devices without affecting each other?

Removing the CPU pthreads and just launching the kernels sequentially with appropriate cudaStreamSynchronize, profiles in NVVP as serial kernel execution with equal total execution time as the cmd line launch WITH or without pthreads. NVVP also shows execution in CUDA Contexts 0-7.

It appears that NVVP profiling the pthread launched kernels as executing in parallel may be a mirage and not representative of what’s really happening.

Can anyone here point to an example of parallel launch/execution of kernels in a multi-GPU machine?

I put a cuda clock on each of the kernel launch threads to measure the time each starts, and elapsed time taken by each of the 2 kernels, All of the devices launch threads start at ~same time, but the 1st kernel elapsed time grows progressively longer as the device number (later kernels) grows larger.

From this observation, it appears that somehow earlier launched kernels on other devices are slowing down later launched kernels only when not running in NVVP.