Difference in Performance

We are working on optimizing some algorithms using CUDA.

To benchmark my GPU speed with that of the CPU speed,

  1. Initially I had taken two separate files in VC++ project solution: one is .cpp file which contains the sequential algorithm (will run in CPU) and the other file .cu which contains the parallel algorithm (will run in GPU).

  2. Later I just merged the sequential and parallel algorithm into single .cu file

When I profiled the above two programs, I got two different results in GPU speed. When I checked why this was so, I found that there is a difference in CPU speedups where as GPU is performing consistently! Can any one explain why this is happening? I am also working to find out more on this…

–Raghavendra Ganji

It is probably due to different compiler optimization options being passed when the .cpp is compile by visual studio and the code generated by nvcc is compiled.

Probably what MisterAnderson42 said.

Which one was faster?

Guys,

It happens tthat “debug” configuration of VC++ uses “O2” as the optimization flag!

However, “NVCC” uses NO optimization when it compiles host code (/Od) – This results in exaggerated GPU speedups!

For example, the sobel filter CPU code without optimization (NVCC compilation) completed around 430ms. With optimization (VC++ in debug configuration), it completes around 150ms. This is a huge huge factor!

This oculd be because of the nature of the sobel algorithm - which is nothing but a 2 FOR loops(nested)! We even tried with a CPU code that just executes a simple FOR loop! Even for this code, the 3x difference exists between NVCC and VC++ default compliation!

This could probably be because of efficient loop unrolling by vc++!

So,

Before your profile your CPU code and release speedups, it is very much necessary that you use a heavily compiler optimized CPU code!

I did NOT realize this so long! I have to go re-do all my speedups that I calculated for Financial algorithms! :-(

Sigh…

NVIDIA guys,

Can you fix this so that NVCC uses the same optimization that VC++ uses for the corresponding compile configuration (i.e. debug/release etc.)

Best Regards,

Sarnath

How could they possibly fix this? nvcc is a command line app no different from gcc on linux. gcc doesn’t magically read your mind and use optimization options when you need to: you specify them on the command line just like you can do with nvcc and -Xcompiler.

NVIDIA sets /O2 and other optimization settings in the SDK template project for release builds (I just checked CUDA SDK 2.0b2) , what more do you want from them?

Projects that use CMake don’t have a problem either because CMake properly sets up the compile flags for generated cpp files. I’ve never use the VS build rule, so I don’t know how it handles the situation.

@MrAnderson42,

Yeah, What you say sounds good! My suggestion indeed looks stupid!

But, NVCC could interpret the build environment and use an appropriate Optimization flag that VC++ or the host compiler uses! That would make it very very consistent!

Because NVCC is the one that spawns the Host compiler – So I would think it is NVCC’s responsibility to make sure that there is no difference between a CPP file compilation and host-part of CU file compilation in the same project!

If you want to change – you can always!

btw, The O2 flag is used by VC++ in “debug” build! And I am not sure about what Opt flag VC++ uses for “Release” code! If that would be different, the SDK samples are also not consistent!

Also Raghu had observed that NVCC passes any optimization flag more than 2 (through the -O option) as O2 only to the VC++ compiler! Not sure why this is so. I need to go and do a more thorough check on this.

Best Regards,
Sarnath

The concept of a “build environment” in VS is just a way of selecting different command line options, even to cl. cl cannot magically read the build environment set by VS except by the command line options passed to it by VS, so how could nvcc? Setting the custom build rules to pass the correct optimization settings to nvcc is the only way, and it is indeed what NVIDIA already does in their template project!

Where do you see this? In the custom build rule for debug in the template project, I have:

-Xcompiler /EHsc,/W3,/nologo,/Wp64,/Od,/Zi,/RTC1,/MTd

No /O2 so I have no idea what you are talking about.

-Xcompiler /EHsc,/W3,/nologo,/Wp64,/Od,/Zi,/RTC1,/MTd

Open a normal CPP project in Visual studio (debug configuration) and compile it and see the command line! You can see O2 in the command line!

However,

Template project seems to be using /Od which is not right.

Yeah, I agree with this! But there must be a way of insulating this from the programmer! Probably some script that reads “${ConfigurationName}” and spawns NVCC with appropriate optimization options!

IMHO, if you’re about to compare different implementations of an algorithm, you must control every parameter throughout the build process by hand. Or at least check any default settings set or passed through.
This is the only way you can be sure you’re comparing the difference between implementations (eg CPU vs GPU) and not some option like compiler optimalization.
Excluding bugs in either one of the implementations of course, which can be a pain. A perfect and complete test procedure should help avoid these.

Thanks for your comments Steve!

I guess I was a bit naive :-)

I wish this thread helped a few guys out there!

Thank you guys,

Best Regards,
Sarnath

A few questions stand out… If a CPU compiler can optimize to an extent where performance increases nearly 3x times amazingly – What about CUDA compiler?

The only optimization option that I find in NVCC (–optimize) is for the HOST code.

What about CUDA code? Is it possible that a compiler optimization can boost up CUDA performance as well?

Does any1 have any idea about CUDA 2.0? If I migrate from CUDA 1.1 – Am I guaranteed some speedup? I think I have asked this question before. But I am not able to find that thread myself!

Appreciate your inputs!

Best Regards,
Sarnath

First - I have not tested it. But: without optimizations, (also) to ease debugging many compilers do stupid things like always writing variables to memory and read it into a register again - I have never seen nvcc do this (no surprise, I guess that would probably cost more like a factor 100 in performance).

In addition, x86 (and to a slightly lesser amount x86_64) have very few registers and instructions with very different characteristics (even when they perform almost the same operation).

In part GPU code does not have these issues, and in part they are nothing nvcc is responsible for (e.g. register “management” happens in the ptx->cubin step).

I am sure that nvcc still can be improved (as I reported in some other thread, it sometimes splits up float4 reads, resulting in uncoalesced access), but the control logic in GPUs is so simple (hardware designers please do not feel insulted :) ) that performance depends almost only on your code - the minor changes a compiler can do hardly ever have a chance to make a big difference.

The only thing I’d recommend to look out for is if you do lots of things like integer multiplications/divisions with constants that could be rewritten as shifts.

Interesting…But I have a vague opinion that “common sub-expression” optimization is not being done for CUDA. I am just wondering if this 2-stage compilation is limiting CUDA from optimizing efficiently… Just my wild guess. If some1 from NVIDIA could talk about this, it would be great!

Hmm… Thats sad…

Yeah I do. But they cant b represented by shifts… they r all floating points.

I hear people talking about MAD (multiply and add) – Do you have any idea how to use this in a CUDA program?

2 points!

Today I did a small experiment! I profiled a FOR loop using /O2 and /OD.

This is the for loop:

for(i=0; i<NUM_ELEMENTS:i ++)

 a[i] = a[i]+i+1;

/O2 option speeds this code by 5 times!

The reason for dismal performance of /Od was :

  1. Index variable “i” is stored and retrieved to/from stack again and again!

  2. No memory locations are cached in registers. As if – they have been declared volatile

  3. Un-necessary loads (although the contents is available in one register, the compiler chooses to load it again)

/O2 generates highly efficient code with less number of instructions as well!

So, it was not related to FOR loop un-rolling as I was thinking before!

VC++ may or may NOT set /O2 as a default for your project – This depends on the type of project that you are creating and your current configuration of course!

If you start an empty project in VC++, /O2 is enabled for “debug” configuration

If you start a console app, /Od is used for “debug” configuration.

Be aware!