I used cublasSdot to compute the dot product “dp” of two vectors x and y, and tried to save it in device memory. However, when I copied “dp” from device memory to host memory, its value was always 0. If I saved “dp” in host memory directly, the result was correct. If “dp” was in host memory, I think saxpy which needs “dp” may be slow since it must get “dp” from host memory. Is my understanding right? Thanks in advance.
I think I have solved this problem. I declared a variable in device memory:
device float dp;
and used dp to store the inner product of the vector p and p:
dp = cublasSdot(n, p, 1, p, 1).
I could directly use printf to output dp on the screen without calling cudaMemcpy to copy dp from device memory to host memory. :P
Doesn’t this just indicate that the return value is simply a regular host value?
did you call the cublas function from within the kernel code ?
No. I called cublas functions before and after the kernel.
I am calling cublasSdot inside the kernel but the code just hangs. Everything else like cublasinit, creating vector onto the device, etc. is correct…I can print correct values too. I tried taking the return value in the __device __ variable too but no success.
Can someone please help?
Thanks,
Aditi