CUBLAS swap_kernel (cublasSwapParams): mysterious most inefficient kernel

Dear all,

In my CUDA program, I use cuBLAS quite extensively for matrix operations.
When I use nVidia Visual Profiler to analyse the program, the most inefficient kernel turns out to be:

void swap_kernel<float, int=0>(cublasSwapParams<float>)

This kernel was called more than 50000 times in my program, and it ranks 100 (i.e. most inefficient) in the list of “kernel optimization Priorities”.

I guess this is an internal function of CUBLAS, but I couldn’t find any information about it.

Could someone give me some hints on the purpose of this kernel? It is called by which CUBLAS function?

This is a screenshot of the profiler when I analyze the kernel:


You might want to find out where swap_kernel() is invoked. Use of the debugger could be helpful in this process.

Is your application code calling cublas[S|D|C|Z]swap() by any chance? I see you are using Thrust, is that calling cublas[S|D|C|Z]swap() somewhere?

Yes, you are right. I use cublasSswap() in my code, and even worse, it’s called inside a loop!
I wrote a new kernel to replace the whole functionality that was implemented by cubasSswap(). It works well now. The program runs much faster and swap_kernel() doesn’t show up in the profiler anymore.