Multiple CPU threads with multiple cudaStreams

krazanmp · July 23, 2015, 12:06am

I have a program that runs up to 6 CPU threads concurrently up to several thousand times as quickly as possible. Each CPU thread is given a unique cudaStream_t handle to allow CUDA to accept data, run kernels and return results. Each cudaStream_t works completely independently from other streams (there is NO GPU-side synchronization attempted whatsoever). As far as the cudaStreams are concerned, they are working independently from any other stream.

From the CPU-side, everything works great. Each CPU thread is definitely working with a unique stream, the results are correct and there are no problems with the correctness of the results. The problem is that the overall performance seems to be lacking.

When viewed in nSight through visual studio, it seems there are huge gaps between kernel invocations. While each kernel does seem to be utilizing the GPU well when it is run, the amount of time the kernels are active is very small resulting in an overall utilization that is very disappointing. What is worse is that the data transfer between the CPU/GPU seems to actually take VERY LITTLE time from the GPU’s perspective, but takes a VERY LONG time on the CPU side.

I always use async interfaces and specify the stream to CUDA when available (never use the default stream - 0).

nvreports show up to 6 concurrent CPU-side api calls, but NEVER shows overlapping kernel activity (apparently this isn’t possible) and never shows overlapping GPU IO.
The real problem is that from the CPU’s perspective, there are huge stalls (of up to 500us) transferring data to/from the GPU, but from the GPU’s perspective, the IO is almost imperceptible (barely shows a blip on the chart). What is worse, there are huge gaps between kernel invocations (like 500us), that can’t be accounted for.

Finally, although I have painstakingly made sure there are unique streams taking commands from the CPU, my nvreports always only show one stream (Stream 0). What I am concerned about (and seems likely from the actual behaviour) is that the multiple streams are serialized into Stream 0 and all that effort to get concurrent separate cudaStreams is pointless.

Any suggestions?

little_jimmy · July 23, 2015, 4:57am

“I always use async interfaces”

i assume with pinned memory for at least d->h transfers?

how do your cpu threads synchronize on the device work they issue?

mwilkinson · July 23, 2015, 5:06am

You should read through this, since it’ll explain better than I can.

But the order that commands are issued to the GPU matters even if they are in separate streams, and if you have multiple threads issuing commands to one GPU then you can’t control the ordering well enough in a lot of cases.

It’s a lot less of problem if your card has HyperQ (which came out after the previous pdf).
http://docs.nvidia.com/cuda/samples/6_Advanced/simpleHyperQ/doc/HyperQ.pdf

krazanmp · July 23, 2015, 3:58pm

@little_jimmy - Yes pinned transfers. The only synchronization is for the CPU thread to check if an event has completed and relinquish its (CPU) timeslice if not.

cudaEvent_t CopyDataBackCompleteEvent = GetCopyEvent(...);
while( cudaEventQuery( CopyDataBackCompleteEvent ) )
{
	SwitchToThread();
}

krazanmp · July 23, 2015, 4:16pm

@mwilkinson - Thanks this is just what I was looking for.

validation that async* api calls refer to the CPU-side behaviour and not necessarily GPU queue manipulation.
validation that the visual profiler should be showing more than 1 stream in its report (leading me to believe that somehow only 1 is actually being used).
a good justification for a hardware upgrade.

I guess the remaining question is to any possible reasons why only 1 stream (Stream 0) is showing up in the reports when there are definitely WAY more than 1 stream being used to queue commands from the CPU-side.

krazanmp · July 23, 2015, 9:45pm

I believe I have solved my problem with nSight only showing 1 stream (Stream 0).

The application I am running takes a LONG time to start up, and as such my behaviour for profiling it is to:

Launch the app for tracing through the visual studio Activity page
In the capture control pane, select cancel to stop profiling
Navigate in the application to the point of interest
In the capture control pane, select start
Execute the kernels at the point of interest
In the capture control pane, select stop

My conjecture is that if nSight isn’t actively running when the streams are created, it has no idea that they exist, even if the kernels do and execute as expected.

I still think this is a problem because (as in my case) the application may have MUCH other stuff going on and the behaviour in need of profiling may NOT be immediately after startup. This means the profiler may be running for many seconds/minutes before anything interesting may actually occur on the GPU resulting in large, empty charts with mostly useless data.

Topic		Replies	Views
Cannot force kernels to concurrent execution CUDA Programming and Performance	8	5561	April 28, 2012
Multiple Streams Performance CUDA Programming and Performance	9	6410	October 19, 2010
confusions about CUDA streams CUDA Programming and Performance	5	824	July 30, 2017
My streams are not running concurrently CUDA Programming and Performance	7	1814	March 6, 2018
Streams concurrency bad performance CUDA Programming and Performance	3	2033	June 13, 2012
CUDA Streams: Start at the same time CUDA Programming and Performance	3	646	November 12, 2021
Visual Profiler and Streams concurrency CUDA Programming and Performance	2	645	June 19, 2018
Concurrent executions of streams CUDA Programming and Performance	6	443	December 19, 2022
cudaStream performance CUDA Programming and Performance	7	1645	June 21, 2016
No performance improvement using CUDA stream DRIVE AGX Xavier General driveos-cuda	11	1652	March 22, 2022

Multiple CPU threads with multiple cudaStreams

Related topics