Hi everybody,
I’ve got two NVIDIA GPUs:
- GeForce 8600M GT (4 multiprocessors, 0.75GHz clock rate)
- GeForce 8800 GT (14 multiprocessors, 1.5GHz clock rate)
With these specs I’d expect a kernel with enough thread blocks to run about 7 times faster (twice the clock rate, 3.5 times more multiprocessors) on the 8800 GT than on the 8600M GT.
Tests show me that it “only” runs about 5 times faster.
I’m taking the raw execution times of the kernel without any memcpy() operations.
Any guesses why?
Thanks a lot!
Simon
Depending on the fraction of memory reads/writes compared to arithmetic instructions in your kernel, you might be partially memory bandwidth bound. What’s the ratio of memory bandwidths for the two GPUs? Your speedup is generally between this bandwidth ratio, and the floating point performance ratio you already calculated.
Well, according to Wikipedia the memory bandwidth is 57.6GB/s (256 bit bus width) for the 8800 GT and 22.4GB/s (128 bit bus width) for the 8600M GT.
The kernel has indeed a lot of memory operations compared to arithmetic instructions.
Thanks! That makes sense!
If the number of blocks in your kernel calls is not much larger than 14, forget about that theoretical speed-up.