streams: are they worth the time? your opinions/experience appreciated

Currently, I’m benchmarking kernels by synchronizing the context after the kernel, and I am seeing something like 56% of the time being spent on the kernels (I’ve done some optimization, it used to be 80%). I’m not doing anything intensive on the CPU whatsoever. Even incredibly small is done on the GPU for the academic sake of making it [theoretically] scalable to any number of processors (given sufficient input data). I’m using python’s time.time() function, which uses Linux’s gettimeofday(), so it should be fairly accurate (I also see the same values very consistently).

If it matters, I’m working on a video coder, mostly to learn CUDA. A numpy array is loaded through the numpy.frombuffer() function, copied to the GPU, and then everything is computed on the GPU. After that, only a few bytes are transferred so the host can know how many blocks to create for a few of the function calls :).

Particularly for those with experience: are streams worth it? Will they take out some of the overhead? How can I pare down the vague notion of “overhead” into something better?

I’m probably going to be writing an event system (I’m using pycuda)… if anyone knows of existing code that would help (or do it), that would be great. I imagine the closures in Python will be a godsend, but I don’t want to write something that’s already done. Threading in Python hasn’t been a lovely experience [fatal gc object already tracked errors, interpreter shutdown problems] (and pycuda will free something whenever the refcount drops, which could happen on a non-device thread); comments on that are of course still welcome but I’m probably leaning towards an event system.

Thanks so much!

Anyone? Sorry, I know bumps are annoying, but this wasn’t supposed to be a difficult question. Thanks in advance.

Also, I did measure times with the CUDA profiler; they appear the similar to my timers (nv prof cpu time was 0.96 of my measurement; nv prof gpu time was 0.91 of nv prof cpu time). memcopy’s took a total of 0.03 of gpu time.

I have tested streams in a few implementations, and have found that streams are very useful if you find the ratio between the size of the memory transfer being done, and the runtime of the kernel. You should just try and see what kind of result you get, it isnt very difficult to implement.

I’m curious why you find that 56% of the time in kernels is better than 80%? Presumably, kernels are faster than running on the host and also prevent the need to memcpy data back and forth. So shouldn’t 99% of the time in kernels be more optimal, leaving the host to just be the conductor?

Or maybe I misunderstood what you mean by overhead? Are you pulling results from a CUDA_PROFILE run and seeing overhead in the CPU bs GPU time?

I think you’re right, but what I think he meant was that after optimizing his kernels, they were faster while the CPU overhead remained the same, thus reducing the proportion of time spent running the kernels. By using streams, he may be able to overlap execution of the kernels with the CPU overhead, and thus approach the 99% utilization mark.

Sorry for the latency replying; yes, 56% was better because the kernels are taking less time (I cut my most expensive one in [less than] half with a complete ~500 line rewrite).

No, I’m not seeing much discrepancy in the CUDA Profiler’s GPU time vs CPU time (gpu time = 0.91 * cpu time); there seems to be some overhead in the Python code, which is quite odd as I’m not doing anything compute-intensive. I should probably try running with real-time priority as well. I’m running an average of around 300 kernels per second.

As copied to Smokey’s post, shutting down X11 didn’t help. It might be a few days before I can give streams a try but I’ll copy any results here.

Thanks for your input,

Yeah, that would seem to indicate that your major bottleneck is now how fast your python code is submitting kernels for launch. All your other overheads are small.