Memory transfer from the CPU to the device memory is time consuming. We ca use either CUBLAS functions or CUDA memcpy functions.

I tried to transfer about 1 million points from CPU to GPU and observed that CUDA function performed copy operation in ~3milliseconds whereas CUBLAS ~0.4 milliseconds.

My question is CUBLAS is also built on GPU but what is soo special abt these functions and why is this performance variation observed.

Appreciate your reply

You mean cudaMemcpy?

I am not sure about the way you implemented. But in my opinion, cudaMemcpy is not parallized while cublas is.

cublasSetVector, cubblasGetVector, cublasSetMatrix, cublasGetMatrix
are thin wrappers around cudaMemcpy and cudaMemcpy2D. Therefore, no
significant performance differences are expected between the two sets
of copy functions.

If you want to increase performance, use pinned memory to allocate the array/matrix on the host