CUBLAS scopy implementation

Hi everyone. I have been playing around with CUDA for a few days now and I seem to have got stuck and was wondering if anyone could help me.

I have been trying to implement my own version of the CUBLAS library’s cublasScopy which, in turn, is nVidia’s implementation of the BLAS scopy (definition). The reason’s behind me doing this is to help me understand the language and the idea of massively parallel processing.

I have attached the code which compares the execution time of running the same scopy on the CPU using BLAS, cublasScopy using CUBLAS and my own customScopy which is in the attached .cu file (which has a .cpp extension to allow me to upload it).

I was wondering if someone could please tell me why my implementation is approximately twice as slow as the CUBLAS implementation and if it is due to a lack of some fundamental understanding of the entire system.

I have a Quadro FX 4600 and am running Visual Studio 2005 on Windows XP.

Thanks in advance,
Jas (2.06 KB)
CUDA_Initial.cpp (7.79 KB)

cuBLAS source code is available.

Right… didn’t know that. I just assumed it would be closed source.

Just did a quick google and it seems it’s only available for registered developers. I suppose Ill have to try and see if they’ll let me in.