I am using CUDA for simulation, since I have 16 AMD cores and one K20 card (ORNL Titan), I want to make the best use of CPU and GPU. So, I divide my problem for CPU and GPU, then I use OpenMP to create 16 threads, one thread for CUDA(memcpy, kernel invoke…), and the other 15 for cpu computing.
But finally I found it takes longer time than GPU-only.
I measured the time on GPU related work, memcpy takes longer time when with openmp thread (from 484s - 752s, 600 iterations). so why?
my decomposition (about 1:10) for CPU and GPU is balanced, they take quite similar time.
To verify, if i make N cores busy in another process, and use GPU-only for simulation, it also affects my simulation time. more or less the more busy cores (maximum 15, 1 left for CUDA), the longer time takes on simulation.
Could someone help me? Great thanks!