Result of a CUBLAS function

Hello !

I’m using some CUBLAS functions that return a value (e.g cublasDdot). The problem is that I need this value to stay on device memory so I can use it in a kernel that comes just after cublasdDot. Currently this value goes to the host memory and I need a cudaMemcopy to get it back on the device mem.

Is it possible to store the result of cublasDdot directly on the device memory ??

Just make the result an argument to the next kernel call? It’s one number, so I don’t see it adding to the kernel launch latency.

I had this idea too. Problem is that I use this result in many kernels including in an other CUBLAS function. I think i’ll go for a memcopy. Cudaprof gives me 3 µs for the transfert of the number, but as it’s inside a loop it’s 3µs*16384…:(

Too bad there’s not a way to return this value in the device memory.

If you use it with many kernel launches after calculation, copy it to a constant memory variable. It will be faster that global memory and won’t use any shared memory, unlike if you pass it as a kernel argument.

The other alternative, of course, is to write your own inner product kernel instead of using BLAS dot - it is a very simple mathematical operation, and there really aren’t many flops in it, so even a naive version probably won’t be all that much different in performance to the CUBLAS version.

Thanks ! Gonna try it now :)

Out of curiosity, what kind of linear algebra are you doing?

I am asking this because we were thinking to propose a second API routine for xDOT with the result written in Device Memory but we would like to know if this is worth the effort

This new API would also make more sense with the upcoming cublas stream support

Now we are talking! Any timeline for streams support? I have several applications that will greatly benefit from exposing streams in CUBLAS…

Coming very soon. Will be there in 3.1Beta…