I have a program that runs up to 6 CPU threads concurrently up to several thousand times as quickly as possible. Each CPU thread is given a unique cudaStream_t handle to allow CUDA to accept data, run kernels and return results. Each cudaStream_t works completely independently from other streams (there is NO GPU-side synchronization attempted whatsoever). As far as the cudaStreams are concerned, they are working independently from any other stream.
From the CPU-side, everything works great. Each CPU thread is definitely working with a unique stream, the results are correct and there are no problems with the correctness of the results. The problem is that the overall performance seems to be lacking.
When viewed in nSight through visual studio, it seems there are huge gaps between kernel invocations. While each kernel does seem to be utilizing the GPU well when it is run, the amount of time the kernels are active is very small resulting in an overall utilization that is very disappointing. What is worse is that the data transfer between the CPU/GPU seems to actually take VERY LITTLE time from the GPU’s perspective, but takes a VERY LONG time on the CPU side.
I always use async interfaces and specify the stream to CUDA when available (never use the default stream - 0).
nvreports show up to 6 concurrent CPU-side api calls, but NEVER shows overlapping kernel activity (apparently this isn’t possible) and never shows overlapping GPU IO.
The real problem is that from the CPU’s perspective, there are huge stalls (of up to 500us) transferring data to/from the GPU, but from the GPU’s perspective, the IO is almost imperceptible (barely shows a blip on the chart). What is worse, there are huge gaps between kernel invocations (like 500us), that can’t be accounted for.
Finally, although I have painstakingly made sure there are unique streams taking commands from the CPU, my nvreports always only show one stream (Stream 0). What I am concerned about (and seems likely from the actual behaviour) is that the multiple streams are serialized into Stream 0 and all that effort to get concurrent separate cudaStreams is pointless.