How lightweight are cudaStream_t's?

I’m working on a thin wrapper for chunks of memory on the device (assume 100+ chunks are allocated at any one point). I was interested in avoiding synchronizing the gpu, but instead giving a cudaStream_t to each chunk of memory (within the class wrapper)… such that for the below calculations

a = b + c;
d = e + f;

we might have something like

CUDA(cudaStreamWaitEvent(stream_a,event_b)); //waits for b to finish computing
CUDA(cudaStreamWaitEvent(stream_a,event_c)); //waits for c to finish computing
a = addvectors(b,c,stream_a);

CUDA(cudaStreamWaitEvent(stream_d,event_e)); //waits for e to finish computing
CUDA(cudaStreamWaitEvent(stream_d,event_f)); //waits for f to finish computing
a = addvectors(e,f,stream_d);

While that sounds good in theory, I wasn’t sure how much of a drain cudaStream_t’s are on the cpu or on the gpu. Does anyone have any input on this? Is it feasible to create 100+ streams on a given gpu and work among them? The code would work fine with reusing the 100 streams, but I’m not sure that I could get the number of streams down to, say, 10 or 20.

On a similar topic, I was also wondering how many unexecuted kernels, or un-met events can queue up on a stream before the CPU starts blocking execution - though this is by far secondary to the first point

I would suggest setting up some simple experiments to find the performance impact on your particular platform. Generally speaking, CPU-side overhead in the CUDA software stack will be a function of single-thread execution speed and system memory throughput. This favors CPU with high baseline clock speeds and DDR4 memory subsystems with many channels (and high throughput per channel). I am not aware what particular GPU-side (or PCIe) mechanisms might be the ultimate limiters of performance for stream signalling, but based on the minimum 5 usec kernel launch time it seems reasonable to assume that any minimal overheads on that side are of that order of magnitude.

Note that the command queue for accepting kernel launches is likely limited in the number of bytes, rather in the number of kernel launches. How many kernel launches you can stuff into the queue will therefore likely depend on (for example) how many kernel arguments are being passed down. There might also be differences between operating systems or driver versions; I don’t know one way or the other. My memory of past queue-depth experiments is hazy, but I seem to recall the maximum depth is O(1000). By all means run your own experiments, which will give you an accurate idea for your hardware platform and driver stack.

I’ve had projects with 20-30 active streams at a time: for hiding memory transfers and parameter updates concurrently. It’s a a great tool for avoiding sycnhronization BUT there is of course also a nice use-case for making sure the GPU has higher utilization with concurrently running kernels :)

As far as i remember a stream waiting for and event to occur on another stream was not a major performance hamper (this might vary a lot by specific use case (#streams + timing) and CPU/GPU).

Of course stalling the CPU thread that was issuing kernels in any way to wait for an event on a stream ( cudaStreamSynchronize etc,…) is generally very bad for performance (assuming you want the GPU to always be working).

Regarding #streams, there used to be hardware limitations for the number of active streams on the GPU (like 16 or 32) but i believe this limit has been increased substantially several years ago. As for inactive streams the limit is likely in the 1000s.

Just out of curiosity, in the cudaStream experience of you guys, have you used shared memory in the process?
I’m asking because I thought I had a problem that would be a case for cudaStream, but the implementation uses shared memory and it would probably require more global memory copies so the subsequent streams know what the previous one came up with… essentially adding another layer of complication.
I will probably not reinvent the wheel at the moment and just stick to what is tested and proven.

Addition with streams time taken: 0.114220
Addition without streams time taken: 0.100948

Just to fill everyone in - I did some benchmarking and did not notice -any- improvement in runtimes (At least for my use case). I did have kernels that ran for substantial periods of time @ 100 occupancy, but I had hoped that when one kernel started to ‘wind down’ its the next kernel could kick in and absorb some of the occupancy.

There’s probably too much of a latency cost in the api calls for it to be of much use.

@sBc-Random:

I’m guessing if you run nvprof you won’t see any overlap occurring?

I wonder what happens if:

(1) you increase the amount of work per data element (ex increase FLOPS/byte) artificially
(2) You decreased your problem size for each kernel, would overlap eventually occur?

Kernel overlapping has always been kind of fickle.

In a situation like that, overlap between kernels is expected to be minimal, provided there is no substantial load imbalance between thread blocks, i.e. they all take roughly the same time from start to finish. Thanks to “zero-overhead” scheduling the ramp up and ramp down at the start and end of a kernel is very steep. For kernels able to utilize all SMs, if there is 50 usec of ramp time on a kernel with 50 ms run time, overlap will be 1% and be invisible in performance measurement noise.

GPUs are designed as throughput machines, not low-latency machines. As long as the processor is filled with work from a single kernel, there is no need to start a different kernel, as the hardware is fully utilized, and that’s all that is needed to achieve good throughput. Noticeable overlap of kernels requires tiny kernels that by themselves are not able to utilize all SMs. I think there is an app demonstrating that among the sample apps that ship with CUDA. It is a best practice to avoid such tiny kernels.

NOTE: Performance artifacts of various kinds can occur on Windows platforms using the (default) WDDM driver, due to the high overhead of that driver model and the mitigation techniques (in particular, launch batching) used by CUDA to partially get around that.