streams: are they worth the time? your opinions/experience appreciated

gatoatigrado · April 17, 2009, 6:37am

Currently, I’m benchmarking kernels by synchronizing the context after the kernel, and I am seeing something like 56% of the time being spent on the kernels (I’ve done some optimization, it used to be 80%). I’m not doing anything intensive on the CPU whatsoever. Even incredibly small is done on the GPU for the academic sake of making it [theoretically] scalable to any number of processors (given sufficient input data). I’m using python’s time.time() function, which uses Linux’s gettimeofday(), so it should be fairly accurate (I also see the same values very consistently).

If it matters, I’m working on a video coder, mostly to learn CUDA. A numpy array is loaded through the numpy.frombuffer() function, copied to the GPU, and then everything is computed on the GPU. After that, only a few bytes are transferred so the host can know how many blocks to create for a few of the function calls :).

Particularly for those with experience: are streams worth it? Will they take out some of the overhead? How can I pare down the vague notion of “overhead” into something better?

I’m probably going to be writing an event system (I’m using pycuda)… if anyone knows of existing code that would help (or do it), that would be great. I imagine the closures in Python will be a godsend, but I don’t want to write something that’s already done. Threading in Python hasn’t been a lovely experience [fatal gc object already tracked errors, interpreter shutdown problems] (and pycuda will free something whenever the refcount drops, which could happen on a non-device thread); comments on that are of course still welcome but I’m probably leaning towards an event system.

Thanks so much!

gatoatigrado · April 18, 2009, 6:43pm

Anyone? Sorry, I know bumps are annoying, but this wasn’t supposed to be a difficult question. Thanks in advance.

Also, I did measure times with the CUDA profiler; they appear the similar to my timers (nv prof cpu time was 0.96 of my measurement; nv prof gpu time was 0.91 of nv prof cpu time). memcopy’s took a total of 0.03 of gpu time.

alexao · April 19, 2009, 9:09pm

I have tested streams in a few implementations, and have found that streams are very useful if you find the ratio between the size of the memory transfer being done, and the runtime of the kernel. You should just try and see what kind of result you get, it isnt very difficult to implement.

MisterAnderson42 · April 20, 2009, 11:57am

I’m curious why you find that 56% of the time in kernels is better than 80%? Presumably, kernels are faster than running on the host and also prevent the need to memcpy data back and forth. So shouldn’t 99% of the time in kernels be more optimal, leaving the host to just be the conductor?

Or maybe I misunderstood what you mean by overhead? Are you pulling results from a CUDA_PROFILE run and seeing overhead in the CPU bs GPU time?

jack · April 20, 2009, 1:46pm

I think you’re right, but what I think he meant was that after optimizing his kernels, they were faster while the CPU overhead remained the same, thus reducing the proportion of time spent running the kernels. By using streams, he may be able to overlap execution of the kernels with the CPU overhead, and thus approach the 99% utilization mark.

gatoatigrado · April 21, 2009, 6:50am

Sorry for the latency replying; yes, 56% was better because the kernels are taking less time (I cut my most expensive one in [less than] half with a complete ~500 line rewrite).

No, I’m not seeing much discrepancy in the CUDA Profiler’s GPU time vs CPU time (gpu time = 0.91 * cpu time); there seems to be some overhead in the Python code, which is quite odd as I’m not doing anything compute-intensive. I should probably try running with real-time priority as well. I’m running an average of around 300 kernels per second.

As copied to Smokey’s post, shutting down X11 didn’t help. It might be a few days before I can give streams a try but I’ll copy any results here.

Thanks for your input,
Nicholas

MisterAnderson42 · April 21, 2009, 1:04pm

Yeah, that would seem to indicate that your major bottleneck is now how fast your python code is submitting kernels for launch. All your other overheads are small.

Topic		Replies	Views
How lightweight are cudaStream_t's? CUDA Programming and Performance	6	1312	September 26, 2018
:rolleyes: wath Gain using stream? code with stream take more time to execute, wath is the gain of s CUDA Programming and Performance	3	7257	February 12, 2010
CUDA stream performance CUDA Programming and Performance	5	2498	July 23, 2013
Will cuda streams reduce kernel invocation overhead ? Cuda streams and kernel invocation overhead CUDA Programming and Performance	1	6712	June 11, 2011
Optimal Use of Streams? CUDA Programming and Performance	16	2682	August 11, 2010
Benefits (or lack thereof) of using CUDA streams for kernel concurrency CUDA Programming and Performance	5	1295	March 17, 2021
Overhead of using more than one streams? CUDA Programming and Performance	5	6284	April 14, 2009
stream processing paradigm CUDA Programming and Performance	2	741	March 17, 2015
Conditions for CUDA streams to overlap CUDA Programming and Performance	5	4570	June 9, 2013
Overlapping CPU and GPU operations using streams. Total failure. Any help? CUDA Programming and Performance	6	6188	April 2, 2013

streams: are they worth the time? your opinions/experience appreciated

Related topics