I am writing my first cuda app (…and so pardon if my understanding is messed up).
I am trying to write port a small program to cuda. It has a couple of core functions which has loops with calls to the blas functions. The loop also accesses the array variables in every iteration apart from calling the blas routines. Now a plain C code linked to cublas library would mean a data transfer between host/device in every iteration which obviously would be a stupid thing. I thought I should define the whole core function as global but now nvcc errors out because I am trying to call a “host” function from a global function.
What am i missing?