simultaneous cublas runs


please, advice me how to solve the following problem with elegant way. I need to run several cubasSgemm with large arrays and simultaneously compute something on my CPU. So, the algorithm is like the following:

for(i=0; i<K; i++)
{ cublasSgemm(… A+i*M, … ,A+(i+1)*M, …);
… run CPU part of my algorithm …

Actually, I have no estimations of computational time of my CPU and GPU parts, and each call to cublasSgemm requires data from previous call.

To make these calls correctly I should run something like:

for(i=0; i<K; i++)
{ cublasSgemm(… A+i*M, …, A+(i+1)*M, …);

however, in this case the process will be blocked.

Am I right that I can run it simultaneously only if I start two threads, the first one will handle only GPU part of my computations, and the second one - CPU part of my computations, or there is more clever way to do so?

Thank you in advance!