I wonder if anyone might help me to understand my speed-ups.
I have a simple Monte-Carlo type simulation for portfolio valuation. It’s quite similar to the Monte-Carlo option pricer in the SDK. I have two CUDA implementations, one in double and one in single precision. The double precision CUDA implementation achieves 70-80x speed-up over my equivilent sequential C++ implementation. That’s pretty good and about what I’d expect.
However, the single precision CUDA implementation achieves as much as 1300x speed-up on the sequential impl… That’s really really fast man! But I can’t quite figure out how it’s possible…
A few details follow:
I pre-generate my randoms and store them in global memory. That might sound dumb but my own investigation has shown that in this case I can get better performance like that. Besides my accesses are perfectly coalesced so that should hide most of the latency.
Using the --ptxas-options=-v compile flag I note that the single precision impl’ uses just 22 registers and the double precision impl’ uses exactly twice that. Could it be that the (single precision) kernel is executing entirely in the register file? -hence the awesome performance…
I’m running on a single Tesla C1060. In all cases only a single application is using the GPU.
I’ve been trying to figure this one out for a while so if anybody has any ideas how this kind of performance is possible I’d be very interested to hear…
PS. Yes! The single precision impl’ IS actually doing the work, I’ve checked that…
I wouldn’t be suprised if your speedups were 10 times smaller than what you’ve quoted. As it is, I just don’t believe it. If it is real then you either have a really badly written sequential algorithm, or a really slow CPU.
I’d carefully check that your timing is being done correctly. What exactly are you timing?
The speed-ups I quote are based on total execution time. I also break down the times further and in all cases, sequential and CUDA, 99% is spent in the Monte-Carlo routine.
@Tigga, I haven’t looked at the sequential impl’ in detail for a long time but it’s such a simple app I’d be surprised if there’s a problem there. I’ll take another look when I’m done writing this post. My CPUs are Xeon clocked at 2.5GHz so they should perform OK.
I’m assuming no-one thinks much of my register space theory then?
The speed-ups I quote are based on total execution time. I also break down the times further and in all cases, sequential and CUDA, 99% is spent in the Monte-Carlo routine.
@Tigga, I haven’t looked at the sequential impl’ in detail for a long time but it’s such a simple app I’d be surprised if there’s a problem there. I’ll take another look when I’m done writing this post. My CPUs are Xeon clocked at 2.5GHz so they should perform OK.
I’m assuming no-one thinks much of my register space theory then?
Do you compile the CPU code with SSE2? This can make a world of a difference (even in sequential apps, at least if you’re running it on some Intel processors)
1000x performance occurs only when you forget to use "cudaThreadSynchronize() or a cudaxxx call " after your kernel launch…
The only time I saw more than 1000x was for our binomial european option pricing running on 4 TESLAs compared to /O2 optimized CPU code (otherwise un-optimized for cache,registers) running on single core of I7.
I’ve had 1550x compared to matlab code written by someone else (that code was non-optimal in some places because of memory-issues at the time of writing the code)
There was a presentation at supercomputing a few years ago showing something like a 800,000x speedup of a simple matrix multiplication routine using handed tuned assembly compared to an equivalent version written for clarity in JAVA, both running on the same processor. The point was to demonstrate that there are at least two factors that influence speedup:
The performance limitations of the hardware being used.
How close each implementation actually gets to the limitations imposed by hardware.
If the maximum performance of a GPU is 1TFLOPs and the maximum performance of a CPU is 100 GFLOPs, then any speedup greater than 10x may be due to an inefficiency of your CPU implementation. This does not say anything about how easy it is to write an application that gets close to the theoretical peak performance of a particular processor, which in general is much harder to quantify.
I have been getting my hands wet with writing tuned Intel code… THats when I realize the importance of register-blocking, cache-blocking, keeping functional units occupied, avoiding core-stalls etc…
Interestingly if you look @ CUDA, Register blocking and Cache-blocking are NATURAL CONSEQUENCE of the prorgramming model – which helps to utilize the underlying hardware to the maximum. However, the same is NOT applicable to Intel.
Intel does NOT release a single software free… Not even the profiling tools… Grr… For example, Intel provides counters to figure out L1 cache misses, hits, core stalling, instruction decode efficiency, efficiency of various stages in the instruction life-cycle… However, these require having OS level priveleges to do it (a driver). However all the Intel tools are commercial… and they dont even provide a basic library-driver to tune the code. How sad!
And they are going gaga about multi-core thing… Intel should work on creating developer awareness to get the most of their own single core. I have raised this point in an intel conference yesterday… I will sure write to them and in their forums as well… Intel is undermining its own position by their commercial attitude.
The theoretical peak GFLOP/s difference between your CPU and the GPU are not a factor 1300.
It’s is likely that
A, Your CPU code is not utilizing the hardware it’s running on , and/or
B, Your timings are wrong, as tmurray pointed out, forgetting cudaThreadSynchronize() is a classic.
This reminds me of when i was asked to make genetics code that took 24 hours on the CPU to run on the GPU. After months of restructuring the CPU code it turned out that it by itself could be optimized to run 6000x faster. This was by far a greater gain then if we would have tried to put a poor implementation on the GPU…
It is also very easy to “cheat” with computer arithmetic. For instance, the C standard libraries typically don’t offer single-precision transcendentals, but map them to double-precision functions, which are much slower but also much more accurate.
Also, single-precision denormals will get flushed to zero on the GPU, but require hundred of extra cycles on the CPU to ensure IEEE-754 compliance… Even using SP division and square root on the GPU can be considered cheating, as they are not IEEE-compliant on Compute Capability <2.0 GPUs.
// Compile with SSE3 enabled
#include <pmmintrin.h>
...
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
_MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);
But I keep thinking that enabling these flags essentially equate to saying: “I prefer my program to be fast and produce wrong answers than be slow and (maybe) correct”. ;)