What's the overhead to call CUBLAS APIs? In my applicastion it is 9 ms.....

Hello All,

I’ve implemented a kernel for a specific code. If I make a cublasSasum call (or any cublas API call for that matter) immediatelly after kernel execution, it takes time somewhere around 20 mSec.

But if the call is not made directly after kernel execution then the same cublas API takes ~0.15 mSec for execution.

To make it clear through example:

Case 1:

Kernel Call : function<<<grid, bock>>>function parameters

cublasSasum();

Case 2:

Kernel Call : function<<<grid, bock>>>function parameters

Some other processing OR a Sleep for ~50 mSec

cublasSasum();

In case 1, time taken by cublasSasum is ~20mSec and in case 2, time taken by cublasSasum is ~0.15 mSec

Can somene please help me understand this behaviour?

Maybe, your kernel execution takes about 20 mSec, please check it.

Haiquan

No. My kernel execution takes only o.1 mili seconds of time.

Did you consider the asynchronity of kernel calls?

How do you get your 0.1 ms? Do you call cudaThreadSynchronize() after kernel execution?