Dear All
I have to launch 16 processes and then parallelize all the code. I had in my machine 24 cores (2 CPUs with 12 cores each). I had in that machine a K40. Will I get better performance using OpenMP and then creating 16 OpenMP threads and then calling the K40? Or it is better call 16 threads in K40 and then parallelize inside the K40?
In both cases I can allocate the memory in the K40 from the host in the initialization of the program and I only need to transfer one time to device and one time from the device in each iteration (after the initialization).
Another question: Is it already released the cuFFT to be called from the device?
Thanks
Luis Gonçalves
THere is already