Hello All,
I’ve implemented a kernel for a specific code. If I make a cublasSasum call (or any cublas API call for that matter) immediatelly after kernel execution, it takes time somewhere around 20 mSec.
But if the call is not made directly after kernel execution then the same cublas API takes ~0.15 mSec for execution.
To make it clear through example:
Case 1:
Kernel Call : function<<<grid, bock>>>function parameters
cublasSasum();
Case 2:
Kernel Call : function<<<grid, bock>>>function parameters
Some other processing OR a Sleep for ~50 mSec
cublasSasum();
In case 1, time taken by cublasSasum is ~20mSec and in case 2, time taken by cublasSasum is ~0.15 mSec
Can somene please help me understand this behaviour?