openMP faster than GPU?

noah977 · June 14, 2012, 5:38pm

Hi,

After getting my compilation issues solved, I"m seeing some very unexpected results.

I use the thrust library to handle all of my CUDA work.

The code takes an average of 90 seconds to run using the GPU. (Assuming that I compiled everything correctly.)
The code takes an average of 6 seconds to run using openMP

I understand that there is some overhead of copying data to the GPU. However, my code outputs status updates as it runs, and they fly up the screen using openMP and move up the screen much slower using CUDA.

I was very careful to ensure that the data is copied to device_vectors ONCE at initiation. After that, 99% of the work is through thrust::transform, thrust::reduce, and thrust::group_by_key. (Which run really nicely and quickly using openMP.)

So, time to diagnose the problems.

Is there a possibility that I didn’t compile things correctly, or am not linking to a library correctly, which would cause a slow execution using the “default” CUDA configuration?
What would be a recommended debug tool to discover what parts of my code are running slowly. (something like grpof, but CUDA aware?)

Any and all suggestions are welcome.

Thanks!

sjiagc · June 15, 2012, 7:23am

Hi noah977,

You can use profiler to measure how long each part of your program takes. There are two profiler - command mode and visual mode(nvvp).

BTW, there may be several factors can reduce performance of your program:

Too much global memory access in kernel function, especially most of access can not be coalesced;
Auto variable that must put into local memory;
Shared memory conflict;
A lot of conditional branch in kernel function, especially they are not in bound of warp;
Too many thread synchronous function invocation;
And so on.

You can use nvvp to locate the most time consuming part and optimize it, then find the next.

Finally but the most important, is your problem suitable to accelerate by GPU? Because GPU can not let a fish swim faster.

Best regards!

noah977 · June 15, 2012, 7:32am

Thanks for the suggestions.

As I wrote above, I’m using the thrust library for all the CUDA work. It is supposed to take care of optimizing memory, threads, blocks, etc. internally. Subsequently, I have no idea how those are being handled.

My problem is definitely suitable to the GPU and parallelism. Many reductions and accumulations.

Topic		Replies	Views
Suggestion for CUDA CUDA Programming and Performance	7	4285	April 29, 2008
Drop in performance while running 2 CUDA application in parallel CUDA on Windows Subsystem for Linux	0	133	May 27, 2024
Slow loading kernel to GPU CUDA Programming and Performance	11	12924	April 18, 2008
Any luck with GMP and CUDA? Anyone compiled GnuMP code in CUDA? CUDA Programming and Performance	5	6173	November 3, 2008
Starter Question Gpu exec time vs Cpu exec time CUDA Programming and Performance	1	3167	February 16, 2012
Why Thrust transform function is so slow CUDA Programming and Performance	3	2518	September 6, 2014
Access to the generated code by OpenMP target pragmas nvc, nvc++ and nvfortran cuda	8	1216	March 1, 2023
GPU is slower than CPU CUDA Programming and Performance	7	17853	August 10, 2017
Help me... Cuda program execution is slower than CPU...Did I miss any settings?? CUDA Programming and Performance	5	1192	September 24, 2015
CUDA performance measure CUDA Programming and Performance	2	5230	January 31, 2009

openMP faster than GPU?

Related topics