I am trying to do de Dot Product of 2 Vectors using a cublas function. My problem is, that i don’t know how to get the calculated result out of the device. Can someone tell me how to continue with this code please?

A thing i not really understand is, now how to use multithreading with CUBLAS. How is the cublasSdot executed in my program? I guess there is only one thread working on it. Because the cublasSdot is a Host-function I can’t put it in a kernel where i can work on thread operations. So how can multithreading be realised with cublas functions (I hope I didn’t skip this part when reading the cublas-manual).

I think the CUBLAS function executes the operation on the device and then copies it back to the host.

There is no indication of this in the documentation, however, and the source code is no longer available.

Try using the CUBLAS dot product function on two very large vectors and compare it with a host version. It should be much faster if CUBLAS is using multi-threading.

Thank you for your hints. I used a 10000x10000 matrix and executed the cublasSaxpy() a hundred times in a loop. The elapsed time using cublas was 33 ms and using a self-programmed function doing the same operations it took around 170000 ms, so it seems to be multi-threaded.