Difference in Performance

raghu · August 14, 2008, 10:40am

We are working on optimizing some algorithms using CUDA.

To benchmark my GPU speed with that of the CPU speed,

Initially I had taken two separate files in VC++ project solution: one is .cpp file which contains the sequential algorithm (will run in CPU) and the other file .cu which contains the parallel algorithm (will run in GPU).
Later I just merged the sequential and parallel algorithm into single .cu file

When I profiled the above two programs, I got two different results in GPU speed. When I checked why this was so, I found that there is a difference in CPU speedups where as GPU is performing consistently! Can any one explain why this is happening? I am also working to find out more on thisâ€¦

–Raghavendra Ganji

MisterAnderson42 · August 14, 2008, 2:23pm

It is probably due to different compiler optimization options being passed when the .cpp is compile by visual studio and the code generated by nvcc is compiled.

Tigga · August 14, 2008, 2:39pm

Probably what MisterAnderson42 said.

Which one was faster?

Sarnath · August 15, 2008, 6:25am

Guys,

It happens tthat “debug” configuration of VC++ uses “O2” as the optimization flag!

However, “NVCC” uses NO optimization when it compiles host code (/Od) – This results in exaggerated GPU speedups!

For example, the sobel filter CPU code without optimization (NVCC compilation) completed around 430ms. With optimization (VC++ in debug configuration), it completes around 150ms. This is a huge huge factor!

This oculd be because of the nature of the sobel algorithm - which is nothing but a 2 FOR loops(nested)! We even tried with a CPU code that just executes a simple FOR loop! Even for this code, the 3x difference exists between NVCC and VC++ default compliation!

This could probably be because of efficient loop unrolling by vc++!

So,

Before your profile your CPU code and release speedups, it is very much necessary that you use a heavily compiler optimized CPU code!

I did NOT realize this so long! I have to go re-do all my speedups that I calculated for Financial algorithms! :-(

Sigh…

NVIDIA guys,

Can you fix this so that NVCC uses the same optimization that VC++ uses for the corresponding compile configuration (i.e. debug/release etc.)

Best Regards,

Sarnath

MisterAnderson42 · August 15, 2008, 12:39pm

How could they possibly fix this? nvcc is a command line app no different from gcc on linux. gcc doesn’t magically read your mind and use optimization options when you need to: you specify them on the command line just like you can do with nvcc and -Xcompiler.

NVIDIA sets /O2 and other optimization settings in the SDK template project for release builds (I just checked CUDA SDK 2.0b2) , what more do you want from them?

Projects that use CMake don’t have a problem either because CMake properly sets up the compile flags for generated cpp files. I’ve never use the VS build rule, so I don’t know how it handles the situation.

Sarnath · August 16, 2008, 4:01am

@MrAnderson42,

Yeah, What you say sounds good! My suggestion indeed looks stupid!

But, NVCC could interpret the build environment and use an appropriate Optimization flag that VC++ or the host compiler uses! That would make it very very consistent!

Because NVCC is the one that spawns the Host compiler – So I would think it is NVCC’s responsibility to make sure that there is no difference between a CPP file compilation and host-part of CU file compilation in the same project!

If you want to change – you can always!

btw, The O2 flag is used by VC++ in “debug” build! And I am not sure about what Opt flag VC++ uses for “Release” code! If that would be different, the SDK samples are also not consistent!

Also Raghu had observed that NVCC passes any optimization flag more than 2 (through the -O option) as O2 only to the VC++ compiler! Not sure why this is so. I need to go and do a more thorough check on this.

Best Regards,
Sarnath

MisterAnderson42 · August 16, 2008, 1:06pm

The concept of a “build environment” in VS is just a way of selecting different command line options, even to cl. cl cannot magically read the build environment set by VS except by the command line options passed to it by VS, so how could nvcc? Setting the custom build rules to pass the correct optimization settings to nvcc is the only way, and it is indeed what NVIDIA already does in their template project!

Where do you see this? In the custom build rule for debug in the template project, I have:

-Xcompiler /EHsc,/W3,/nologo,/Wp64,/Od,/Zi,/RTC1,/MTd

No /O2 so I have no idea what you are talking about.

Sarnath · August 17, 2008, 9:27am

-Xcompiler /EHsc,/W3,/nologo,/Wp64,/Od,/Zi,/RTC1,/MTd

Open a normal CPP project in Visual studio (debug configuration) and compile it and see the command line! You can see O2 in the command line!

However,

Template project seems to be using /Od which is not right.

Yeah, I agree with this! But there must be a way of insulating this from the programmer! Probably some script that reads “${ConfigurationName}” and spawns NVCC with appropriate optimization options!

S.Warris · August 18, 2008, 4:40am

IMHO, if you’re about to compare different implementations of an algorithm, you must control every parameter throughout the build process by hand. Or at least check any default settings set or passed through.
This is the only way you can be sure you’re comparing the difference between implementations (eg CPU vs GPU) and not some option like compiler optimalization.
Excluding bugs in either one of the implementations of course, which can be a pain. A perfect and complete test procedure should help avoid these.

Sarnath · August 18, 2008, 6:15am

Thanks for your comments Steve!

I guess I was a bit naive :-)

I wish this thread helped a few guys out there!

Thank you guys,

Best Regards,
Sarnath

Sarnath · August 19, 2008, 12:26pm

A few questions stand out… If a CPU compiler can optimize to an extent where performance increases nearly 3x times amazingly – What about CUDA compiler?

The only optimization option that I find in NVCC (–optimize) is for the HOST code.

What about CUDA code? Is it possible that a compiler optimization can boost up CUDA performance as well?

Does any1 have any idea about CUDA 2.0? If I migrate from CUDA 1.1 – Am I guaranteed some speedup? I think I have asked this question before. But I am not able to find that thread myself!

Appreciate your inputs!

Best Regards,
Sarnath

Reimar · August 19, 2008, 1:10pm

First - I have not tested it. But: without optimizations, (also) to ease debugging many compilers do stupid things like always writing variables to memory and read it into a register again - I have never seen nvcc do this (no surprise, I guess that would probably cost more like a factor 100 in performance).

In addition, x86 (and to a slightly lesser amount x86_64) have very few registers and instructions with very different characteristics (even when they perform almost the same operation).

In part GPU code does not have these issues, and in part they are nothing nvcc is responsible for (e.g. register “management” happens in the ptx->cubin step).

I am sure that nvcc still can be improved (as I reported in some other thread, it sometimes splits up float4 reads, resulting in uncoalesced access), but the control logic in GPUs is so simple (hardware designers please do not feel insulted :) ) that performance depends almost only on your code - the minor changes a compiler can do hardly ever have a chance to make a big difference.

The only thing I’d recommend to look out for is if you do lots of things like integer multiplications/divisions with constants that could be rewritten as shifts.

Sarnath · August 19, 2008, 1:27pm

Interesting…But I have a vague opinion that “common sub-expression” optimization is not being done for CUDA. I am just wondering if this 2-stage compilation is limiting CUDA from optimizing efficiently… Just my wild guess. If some1 from NVIDIA could talk about this, it would be great!

Hmm… Thats sad…

Yeah I do. But they cant b represented by shifts… they r all floating points.

I hear people talking about MAD (multiply and add) – Do you have any idea how to use this in a CUDA program?

Sarnath · August 20, 2008, 11:08am

2 points!

Today I did a small experiment! I profiled a FOR loop using /O2 and /OD.

This is the for loop:

for(i=0; i<NUM_ELEMENTS:i ++)

 a[i] = a[i]+i+1;

/O2 option speeds this code by 5 times!

The reason for dismal performance of /Od was :

Index variable “i” is stored and retrieved to/from stack again and again!
No memory locations are cached in registers. As if – they have been declared volatile
Un-necessary loads (although the contents is available in one register, the compiler chooses to load it again)

/O2 generates highly efficient code with less number of instructions as well!

So, it was not related to FOR loop un-rolling as I was thinking before!

VC++ may or may NOT set /O2 as a default for your project – This depends on the type of project that you are creating and your current configuration of course!

If you start an empty project in VC++, /O2 is enabled for “debug” configuration

If you start a console app, /Od is used for “debug” configuration.

Be aware!

Topic		Replies	Views
VS2022+cuda CUDA Programming and Performance	19	146	November 14, 2024
A different way to think about writing GPU applications CUDA Programming and Performance	18	3630	January 4, 2013
Debugging cuda code using visual studio CUDA Programming and Performance	23	73682	December 20, 2011
CUDA 8 support for c++14, Windows/Linux CUDA Programming and Performance	7	9898	October 26, 2016
first install of cuda CUDA Setup and Installation	6	7630	February 12, 2017
Understanding code optimization resulting from the --gpu-architecture, --gpu-code and --generate-code flags CUDA NVCC Compiler	1	767	May 31, 2024
Significant speed gap between CUDA and OpenCL - how to debug? CUDA Programming and Performance	3	7466	January 28, 2018
Slow compile and cudaMalloc CUDA Programming and Performance	8	3692	February 2, 2011
Different output of code when not unrolling loop CUDA Programming and Performance	16	1072	August 22, 2022
CUDA not ready for prime time How can anyone justify CUDA for corporate development? CUDA Programming and Performance	35	15148	June 7, 2010

Difference in Performance

Related topics