In my CUDA program, I use cuBLAS quite extensively for matrix operations.
When I use nVidia Visual Profiler to analyse the program, the most inefficient kernel turns out to be:
void swap_kernel<float, int=0>(cublasSwapParams<float>)
This kernel was called more than 50000 times in my program, and it ranks 100 (i.e. most inefficient) in the list of “kernel optimization Priorities”.
I guess this is an internal function of CUBLAS, but I couldn’t find any information about it.
Could someone give me some hints on the purpose of this kernel? It is called by which CUBLAS function?
This is a screenshot of the profiler when I analyze the kernel: