Memory transfer from the CPU to the device memory is time consuming. We ca use either CUBLAS functions or CUDA memcpy functions.
I tried to transfer about 1 million points from CPU to GPU and observed that CUDA function performed copy operation in ~3milliseconds whereas CUBLAS ~0.4 milliseconds.
My question is CUBLAS is also built on GPU but what is soo special abt these functions and why is this performance variation observed.
Appreciate your reply