CUBLAS functions in a kernel

I am writing my first cuda app (…and so pardon if my understanding is messed up).
I am trying to write port a small program to cuda. It has a couple of core functions which has loops with calls to the blas functions. The loop also accesses the array variables in every iteration apart from calling the blas routines. Now a plain C code linked to cublas library would mean a data transfer between host/device in every iteration which obviously would be a stupid thing. I thought I should define the whole core function as global but now nvcc errors out because I am trying to call a “host” function from a global function.

What am i missing?

I added a declaration for the cublas function with the device keyword and it compiled. Is this the right way to go?

I guess the device function declaration doesnt work. It now gives an error that external calls are not supported (found non-inlined call to cublasIsamax)
any other ideas?

CUBLAS functions cannot be called from device code because the CUBLAS code needs to setup a thread grid itself to do the work, which it can’t do from inside an already running thread grid.

You can write a kernel just to do your custom calculation between CUBLAS calls. This avoids the device->host copy each iteration, but still lets you call CUBLAS from the host for the other parts of the calculation.

thanks for the info seibert. I understand.

…so related to this. IF I write my custom computation as another kernel. How do I best pass data between the kernels? device variable in global memory?

device temp;
mykernel<<< >>>(); //kernel sets the temp variable
//do i have to fetch temp here explicitly from the device memory to pass it as an argument in this host function?