The return of cublasSdot The return must be in CPU memory?

I used cublasSdot to compute the dot product “dp” of two vectors x and y, and tried to save it in device memory. However, when I copied “dp” from device memory to host memory, its value was always 0. If I saved “dp” in host memory directly, the result was correct. If “dp” was in host memory, I think saxpy which needs “dp” may be slow since it must get “dp” from host memory. Is my understanding right? Thanks in advance.

I think I have solved this problem. I declared a variable in device memory:

device float dp;

and used dp to store the inner product of the vector p and p:

dp = cublasSdot(n, p, 1, p, 1).

I could directly use printf to output dp on the screen without calling cudaMemcpy to copy dp from device memory to host memory. :P

Doesn’t this just indicate that the return value is simply a regular host value?

did you call the cublas function from within the kernel code ?

No. I called cublas functions before and after the kernel.

I am calling cublasSdot inside the kernel but the code just hangs. Everything else like cublasinit, creating vector onto the device, etc. is correct…I can print correct values too. I tried taking the return value in the __device __ variable too but no success.

Can someone please help?