Multiple CPU threads with multiple cudaStreams

I have a program that runs up to 6 CPU threads concurrently up to several thousand times as quickly as possible. Each CPU thread is given a unique cudaStream_t handle to allow CUDA to accept data, run kernels and return results. Each cudaStream_t works completely independently from other streams (there is NO GPU-side synchronization attempted whatsoever). As far as the cudaStreams are concerned, they are working independently from any other stream.

From the CPU-side, everything works great. Each CPU thread is definitely working with a unique stream, the results are correct and there are no problems with the correctness of the results. The problem is that the overall performance seems to be lacking.

When viewed in nSight through visual studio, it seems there are huge gaps between kernel invocations. While each kernel does seem to be utilizing the GPU well when it is run, the amount of time the kernels are active is very small resulting in an overall utilization that is very disappointing. What is worse is that the data transfer between the CPU/GPU seems to actually take VERY LITTLE time from the GPU’s perspective, but takes a VERY LONG time on the CPU side.

I always use async interfaces and specify the stream to CUDA when available (never use the default stream - 0).

nvreports show up to 6 concurrent CPU-side api calls, but NEVER shows overlapping kernel activity (apparently this isn’t possible) and never shows overlapping GPU IO.
The real problem is that from the CPU’s perspective, there are huge stalls (of up to 500us) transferring data to/from the GPU, but from the GPU’s perspective, the IO is almost imperceptible (barely shows a blip on the chart). What is worse, there are huge gaps between kernel invocations (like 500us), that can’t be accounted for.

Finally, although I have painstakingly made sure there are unique streams taking commands from the CPU, my nvreports always only show one stream (Stream 0). What I am concerned about (and seems likely from the actual behaviour) is that the multiple streams are serialized into Stream 0 and all that effort to get concurrent separate cudaStreams is pointless.

Any suggestions?

“I always use async interfaces”

i assume with pinned memory for at least d->h transfers?

how do your cpu threads synchronize on the device work they issue?

You should read through this, since it’ll explain better than I can.

But the order that commands are issued to the GPU matters even if they are in separate streams, and if you have multiple threads issuing commands to one GPU then you can’t control the ordering well enough in a lot of cases.

It’s a lot less of problem if your card has HyperQ (which came out after the previous pdf).
http://docs.nvidia.com/cuda/samples/6_Advanced/simpleHyperQ/doc/HyperQ.pdf

@little_jimmy - Yes pinned transfers. The only synchronization is for the CPU thread to check if an event has completed and relinquish its (CPU) timeslice if not.

cudaEvent_t CopyDataBackCompleteEvent = GetCopyEvent(...);
while( cudaEventQuery( CopyDataBackCompleteEvent ) )
{
	SwitchToThread();
}

@mwilkinson - Thanks this is just what I was looking for.

  1. validation that async* api calls refer to the CPU-side behaviour and not necessarily GPU queue manipulation.
  2. validation that the visual profiler should be showing more than 1 stream in its report (leading me to believe that somehow only 1 is actually being used).
  3. a good justification for a hardware upgrade.

I guess the remaining question is to any possible reasons why only 1 stream (Stream 0) is showing up in the reports when there are definitely WAY more than 1 stream being used to queue commands from the CPU-side.

I believe I have solved my problem with nSight only showing 1 stream (Stream 0).

The application I am running takes a LONG time to start up, and as such my behaviour for profiling it is to:

  1. Launch the app for tracing through the visual studio Activity page
  2. In the capture control pane, select cancel to stop profiling
  3. Navigate in the application to the point of interest
  4. In the capture control pane, select start
  5. Execute the kernels at the point of interest
  6. In the capture control pane, select stop

My conjecture is that if nSight isn’t actively running when the streams are created, it has no idea that they exist, even if the kernels do and execute as expected.

I still think this is a problem because (as in my case) the application may have MUCH other stuff going on and the behaviour in need of profiling may NOT be immediately after startup. This means the profiler may be running for many seconds/minutes before anything interesting may actually occur on the GPU resulting in large, empty charts with mostly useless data.