cublasddot intolerably slow for host pointers on Windows with CUDA 7.5, Tesla K40

gmandrew · November 13, 2015, 11:50pm

I noticed (via VS code profiler) that cublasDdot, when called in host pointer mode, is absurdly slow: it is about 20x slower than calling it in device pointer mode AND manually transferring the result back to the host with cudaMemcpy.

I am OK with my solution of transferring the result manually, but I wanted to post here because I presume this is a bug that should be addressed in future versions.

System properties:
Windows 8.1 OS
Visual Studio 2013
CUDA 7.5
Tesla K40 GPU

Robert_Crovella · November 14, 2015, 12:51am

bugs should be filed at developer.nvidia.com