Phenomenal Speed-up!

timmiroon · November 12, 2009, 5:46pm

I wonder if anyone might help me to understand my speed-ups.

I have a simple Monte-Carlo type simulation for portfolio valuation. It’s quite similar to the Monte-Carlo option pricer in the SDK. I have two CUDA implementations, one in double and one in single precision. The double precision CUDA implementation achieves 70-80x speed-up over my equivilent sequential C++ implementation. That’s pretty good and about what I’d expect.

However, the single precision CUDA implementation achieves as much as 1300x speed-up on the sequential impl… That’s really really fast man! But I can’t quite figure out how it’s possible…

A few details follow:

I pre-generate my randoms and store them in global memory. That might sound dumb but my own investigation has shown that in this case I can get better performance like that. Besides my accesses are perfectly coalesced so that should hide most of the latency.

Using the --ptxas-options=-v compile flag I note that the single precision impl’ uses just 22 registers and the double precision impl’ uses exactly twice that. Could it be that the (single precision) kernel is executing entirely in the register file? -hence the awesome performance…

I’m running on a single Tesla C1060. In all cases only a single application is using the GPU.

I’ve been trying to figure this one out for a while so if anybody has any ideas how this kind of performance is possible I’d be very interested to hear…

PS. Yes! The single precision impl’ IS actually doing the work, I’ve checked that…

Tigga · November 12, 2009, 5:54pm

I wouldn’t be suprised if your speedups were 10 times smaller than what you’ve quoted. As it is, I just don’t believe it. If it is real then you either have a really badly written sequential algorithm, or a really slow CPU.

I’d carefully check that your timing is being done correctly. What exactly are you timing?

tmurray · November 12, 2009, 6:17pm

are you only timing the kernel launch and not doing a cudaThreadSynchronize after it…

timmiroon · November 12, 2009, 6:25pm

Hi guys,

The speed-ups I quote are based on total execution time. I also break down the times further and in all cases, sequential and CUDA, 99% is spent in the Monte-Carlo routine.

@Tigga, I haven’t looked at the sequential impl’ in detail for a long time but it’s such a simple app I’d be surprised if there’s a problem there. I’ll take another look when I’m done writing this post. My CPUs are Xeon clocked at 2.5GHz so they should perform OK.

I’m assuming no-one thinks much of my register space theory then?

timmiroon · November 12, 2009, 6:26pm

Hi guys,

The speed-ups I quote are based on total execution time. I also break down the times further and in all cases, sequential and CUDA, 99% is spent in the Monte-Carlo routine.

@Tigga, I haven’t looked at the sequential impl’ in detail for a long time but it’s such a simple app I’d be surprised if there’s a problem there. I’ll take another look when I’m done writing this post. My CPUs are Xeon clocked at 2.5GHz so they should perform OK.

I’m assuming no-one thinks much of my register space theory then?

_Big_Mac · November 12, 2009, 7:07pm

Do you compile the CPU code with SSE2? This can make a world of a difference (even in sequential apps, at least if you’re running it on some Intel processors)

Sarnath · November 13, 2009, 4:58am

1000x performance occurs only when you forget to use "cudaThreadSynchronize() or a cudaxxx call " after your kernel launch…

The only time I saw more than 1000x was for our binomial european option pricing running on 4 TESLAs compared to /O2 optimized CPU code (otherwise un-optimized for cache,registers) running on single core of I7.

E.D_Riedijk · November 13, 2009, 5:19am

I’ve had 1550x compared to matlab code written by someone else (that code was non-optimal in some places because of memory-issues at the time of writing the code)

Gregory_Diamos · November 13, 2009, 6:08am

There was a presentation at supercomputing a few years ago showing something like a 800,000x speedup of a simple matrix multiplication routine using handed tuned assembly compared to an equivalent version written for clarity in JAVA, both running on the same processor. The point was to demonstrate that there are at least two factors that influence speedup:

The performance limitations of the hardware being used.
How close each implementation actually gets to the limitations imposed by hardware.

If the maximum performance of a GPU is 1TFLOPs and the maximum performance of a CPU is 100 GFLOPs, then any speedup greater than 10x may be due to an inefficiency of your CPU implementation. This does not say anything about how easy it is to write an application that gets close to the theoretical peak performance of a particular processor, which in general is much harder to quantify.

Sarnath · November 13, 2009, 7:23am

I have to agree a lot with Gregory…Absolutely…

I have been getting my hands wet with writing tuned Intel code… THats when I realize the importance of register-blocking, cache-blocking, keeping functional units occupied, avoiding core-stalls etc…

Interestingly if you look @ CUDA, Register blocking and Cache-blocking are NATURAL CONSEQUENCE of the prorgramming model – which helps to utilize the underlying hardware to the maximum. However, the same is NOT applicable to Intel.

Intel does NOT release a single software free… Not even the profiling tools… Grr… For example, Intel provides counters to figure out L1 cache misses, hits, core stalling, instruction decode efficiency, efficiency of various stages in the instruction life-cycle… However, these require having OS level priveleges to do it (a driver). However all the Intel tools are commercial… and they dont even provide a basic library-driver to tune the code. How sad!

And they are going gaga about multi-core thing… Intel should work on creating developer awareness to get the most of their own single core. I have raised this point in an intel conference yesterday… I will sure write to them and in their forums as well… Intel is undermining its own position by their commercial attitude.

Jimmy_Pettersson · November 13, 2009, 8:26am

The theoretical peak GFLOP/s difference between your CPU and the GPU are not a factor 1300.

It’s is likely that

A, Your CPU code is not utilizing the hardware it’s running on , and/or
B, Your timings are wrong, as tmurray pointed out, forgetting cudaThreadSynchronize() is a classic.

This reminds me of when i was asked to make genetics code that took 24 hours on the CPU to run on the GPU. After months of restructuring the CPU code it turned out that it by itself could be optimized to run 6000x faster. This was by far a greater gain then if we would have tried to put a poor implementation on the GPU…

Sylvain_Collange · November 13, 2009, 11:34am

Speedups higher than 10x are still possible when using GPU hardware features that are not available or slow on CPUs, such as:

texture filtering,
single-precision transcendental functions (expf, logf, sinf, cosf), divide and square root,
double-precision denormals,
double-precision FMA,
changing rounding modes…

It is also very easy to “cheat” with computer arithmetic. For instance, the C standard libraries typically don’t offer single-precision transcendentals, but map them to double-precision functions, which are much slower but also much more accurate.
Also, single-precision denormals will get flushed to zero on the GPU, but require hundred of extra cycles on the CPU to ensure IEEE-754 compliance… Even using SP division and square root on the GPU can be considered cheating, as they are not IEEE-compliant on Compute Capability <2.0 GPUs.

Sarnath · November 13, 2009, 12:02pm

btw, Intel CPUs have FTZ mode which when turned on can speed up computing… Therez one anoter flag as well… not able to recollect @ the momment…

Sylvain_Collange · November 13, 2009, 3:51pm

DAZ, for treating input operands as zeros:

// Compile with SSE3 enabled

#include <pmmintrin.h>

...

_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);

_MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);

But I keep thinking that enabling these flags essentially equate to saying: “I prefer my program to be fast and produce wrong answers than be slow and (maybe) correct”. ;)

Topic		Replies	Views
300x to 600x times faster... really? CUDA Programming and Performance	92	34413	February 8, 2010
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	679	April 4, 2017
CUDA book by Kirk & Whu available CUDA Programming and Performance	44	12114	February 10, 2010
A question on single and double precision performance calculation with CUDA cores CUDA Programming and Performance	7	1833	May 31, 2024
Mythical Tflops CUDA Programming and Performance	11	1113	January 14, 2019
Can speed up ratio greater than the number of GPU processor cores? CUDA Programming and Performance	11	2738	June 3, 2010
cuda phylosophy is that really C? CUDA Programming and Performance	12	8227	May 6, 2008
CUDA vs Intel IPP in signal processing CUDA Programming and Performance	10	23395	May 22, 2024
What is code compiled with -arch=sm_13 slower? CUDA Programming and Performance	16	3662	April 22, 2009
Significant speed gap between CUDA and OpenCL - how to debug? CUDA Programming and Performance	3	7511	January 28, 2018

Phenomenal Speed-up!

Related topics