openMP faster than GPU?

Hi,

After getting my compilation issues solved, I"m seeing some very unexpected results.

I use the thrust library to handle all of my CUDA work.

The code takes an average of 90 seconds to run using the GPU. (Assuming that I compiled everything correctly.)
The code takes an average of 6 seconds to run using openMP

I understand that there is some overhead of copying data to the GPU. However, my code outputs status updates as it runs, and they fly up the screen using openMP and move up the screen much slower using CUDA.

I was very careful to ensure that the data is copied to device_vectors ONCE at initiation. After that, 99% of the work is through thrust::transform, thrust::reduce, and thrust::group_by_key. (Which run really nicely and quickly using openMP.)

So, time to diagnose the problems.

  1. Is there a possibility that I didn’t compile things correctly, or am not linking to a library correctly, which would cause a slow execution using the “default” CUDA configuration?
  2. What would be a recommended debug tool to discover what parts of my code are running slowly. (something like grpof, but CUDA aware?)

Any and all suggestions are welcome.

Thanks!

Hi noah977,

You can use profiler to measure how long each part of your program takes. There are two profiler - command mode and visual mode(nvvp).

BTW, there may be several factors can reduce performance of your program:

  1. Too much global memory access in kernel function, especially most of access can not be coalesced;
  2. Auto variable that must put into local memory;
  3. Shared memory conflict;
  4. A lot of conditional branch in kernel function, especially they are not in bound of warp;
  5. Too many thread synchronous function invocation;
  6. And so on.

You can use nvvp to locate the most time consuming part and optimize it, then find the next.

Finally but the most important, is your problem suitable to accelerate by GPU? Because GPU can not let a fish swim faster.

Best regards!

Thanks for the suggestions.

As I wrote above, I’m using the thrust library for all the CUDA work. It is supposed to take care of optimizing memory, threads, blocks, etc. internally. Subsequently, I have no idea how those are being handled.

My problem is definitely suitable to the GPU and parallelism. Many reductions and accumulations.