Guess that you refer to sample ‘~/NVIDIA_CUDA-8.0_Samples/6_Advanced/concurrentKernels’.
Please remember to launch kernel with the different stream to allow concurrent execution.
From the picture you shared at #17, the kernel doesn’t execute concurrently.
Could you share profiler’s results of concurrent version?
Guess that you refer to sample ‘~/NVIDIA_CUDA-8.0_Samples/6_Advanced/concurrentKernels’.
No, I was referring to array summation above, ~/NVIDIA_CUDA-6.5_Samples/6_Advanced/reduction/.
Please remember to launch kernel with the different stream to allow concurrent execution.
From the picture you shared at #17, the kernel doesn’t execute concurrently.
The screenshot I attached in #17 was with a single execution stream only.
Could you share profiler’s results of concurrent version?
Attached. You can see the kernels in streams 13, 14, and 15 now run more or less concurrently - I’m guessing depending on what the scheduler decides to do.
However, you can see the real problem is the same as before - there is a random stall at cudaLaunch. Sometimes it happens at cudaEventRecord, or cudaSetupArgument.
For example, if the maximal thread of your device is three, and your job requires two threads at a time.
[Time]: [Thread1] [Thread2] [Thread3] Without sync
T0: Job1.A, Job1.B, Job2.A
T1: Job2.B, Job3.A, Job3.B
…
Execution time of Job1 is T0
Execution time of Job2 is T0+T1 = 2T0
With sync
T0: Job1.A, Job1.B, IDLE
T1: Job2.A, Job2.B, IDLE
…
Execution time of Job1 is T0
Execution time of Job2 is T1 = T0
I agree with what you say in general, but in the case which I attached, the jobs are already surrounded by synchronisation code:
checkCuda(cudaEventRecord(startEvent, 0));
load(); // GPU jobs submitted here
checkCuda(cudaEventRecord(stopEvent, 0));
checkCuda(cudaEventSynchronize(stopEvent));
The cudaEventSynchronize call will “wait until the completion of all device work preceding the most recent call to cudaEventRecord()”, as per the documentation.
For this reason, the extra cudaDeviceSynchronize calls (surrounded by the FORCE_SYNC_MEM_TRANSFERS block) only affect the memory transfer operations in the source code which I attached. Do you agree?
DeviceSynchronize is to make sure all the CUDA threads already finish its jobs.
If the extra loop is only for memory transition, then the answer to your question should be yes.