After getting my compilation issues solved, I"m seeing some very unexpected results.
I use the thrust library to handle all of my CUDA work.
The code takes an average of 90 seconds to run using the GPU. (Assuming that I compiled everything correctly.)
The code takes an average of 6 seconds to run using openMP
I understand that there is some overhead of copying data to the GPU. However, my code outputs status updates as it runs, and they fly up the screen using openMP and move up the screen much slower using CUDA.
I was very careful to ensure that the data is copied to device_vectors ONCE at initiation. After that, 99% of the work is through thrust::transform, thrust::reduce, and thrust::group_by_key. (Which run really nicely and quickly using openMP.)
So, time to diagnose the problems.
- Is there a possibility that I didn’t compile things correctly, or am not linking to a library correctly, which would cause a slow execution using the “default” CUDA configuration?
- What would be a recommended debug tool to discover what parts of my code are running slowly. (something like grpof, but CUDA aware?)
Any and all suggestions are welcome.