Which I will get better performance? Is already out the cufft to be called within the device?

Dear All

 I have to launch 16 processes and then parallelize all the code. I had in my machine 24 cores (2 CPUs with 12 cores each). I had in that machine a K40. Will I get better performance using OpenMP and then creating 16 OpenMP threads and then calling the K40? Or it is better call 16 threads in K40 and then parallelize inside the K40?

In both cases I can allocate the memory in the K40 from the host in the initialization of the program and I only need to transfer one time to device and one time from the device in each iteration (after the initialization).

Another question: Is it already released the cuFFT to be called from the device?

Thanks

Luis Gonçalves
THere is already

Maybe you should explain your problem in more detail.
What do you mean by

I mean launch a single kernell with 16 threads (and the call inside device) versus running 16 threads in OpenMP (then call device from the host)

Thanks

Luis Gonçalves

As far as I know it is not yet supported to launch cufft from device.