From low end GPUs to high end GPUs Moving from 9600GT to Tesla T10 provides no improvement, why ?

tera · June 2, 2010, 8:32pm

Yes, I also do not understand what this scaling is good for. I don’t know if there is a difference in the Linux and Windows versions, but so far I’ve only looked at the tabulated absolute counter readings when I’ve profiled my own code.

If the code is 55% computation bounded on the 9600GT, it would be memory bounded on the Tesla. That might actually get close to the numbers you are seeing.

So accesses within a warp would be to consecutive elements, but aligned only if [font=“Courier New”]kernel_data.stride[0][/font] were a multiple of 16. This apparently is the case, as otherwise the reads would not be coalesced on the 9600GT. Is this reasoning correct?

Non-aligned accesses would probably explain some of the discrepancy, as obviously they need different treatment in read and write accesses (this might also explain the asymmetry in the performance counter definitions).

tera · June 2, 2010, 8:41pm

By the way, type (size) is [font=“Courier New”]kernel_data.data[0][/font]?

rodrigob · June 2, 2010, 9:50pm

Could you explain what is your reasoning behind this ?

I’m using cudatemplate’s Cuda::DeviceMemoryPitched3D of type int (that 32 bits as indicated in the CUDA programming guide). Up to my understanding (and speed tests) pitched memory takes care of aligning the memory.

tera · June 2, 2010, 10:58pm

Never mind. Reading your initial post I see that I misunderstood your post. 55% is the time spent in the compute-bound kernel, not the amount of computation in that kernel.

Which makes me wonder however whether the whole program is dominated by other delays like PCIe transfer times. I find the “GPU time width plot” with timestamps enabled quite useful to assess that. Comparing those for both devices might give an idea which part of the program falls short of the speedup expectation.

rodrigob · June 8, 2010, 4:38pm

Thank you all for your participation on this thread. Via this small reply I will try to summarise the “final answer” to my two original questions.

Regarding the weird profiler output between the two GPUs this was explained because:

[*] The profiler plots per function provides counts not timings

[*] The counts are scaled depending on the kind of memory access they do. So the counts between memory access and instructions cannot be directly compared. To my eyes this is a (documented) bug of the profiler.

If finally noticed that the performances does increase, but only when I use a larger data volume. When doubling the data volume, then the Teslta GPU provides a 2x performance boost over the 9600GT GPU (and a 80 times boost compared to a single CPU core).

So the lesson was: when testing for speed between GPUs, test it will “large data volume”.

Probably at low data volume my code is not correctly occupying all the resources of the GPU, or some kind of latency is taking over compared to computation time. The fact is, that I mainly care about the large volume case, and I was only testing on small volumes to iterate the tests faster. No I know that this was a mistake.

Again, thanks you all for the support.

Topic		Replies	Views
Benchmarking Different Memory Access Patterns CUDA Programming and Performance	6	1789	June 11, 2008
Code runs 3x times faster on X260 than on tesla c1060 CUDA Programming and Performance	21	5978	October 7, 2009
VisualProfiler ver 2.2 CUDA Programming and Performance	13	4933	April 10, 2009
Attention Lucky GTX 480/GTX 470 Owners! Please run some benchmarks for us. :) CUDA Programming and Performance	88	22768	May 5, 2010
Tesla S2050 performance double precision performance too low CUDA Programming and Performance	42	29383	December 8, 2010
How to get peak rate with simple opeartion Question about performance optimization CUDA Programming and Performance	17	13734	June 2, 2008
How to explain the performance difference? CUDA Programming and Performance	7	3566	March 26, 2008
How to get more Gflops ? :) CUDA Programming and Performance	21	27760	September 12, 2008
GPU Perfomance How much GFlops??? CUDA Programming and Performance	27	37763	August 30, 2009
Fermi question CUDA Programming and Performance	30	5766	May 26, 2010

From low end GPUs to high end GPUs Moving from 9600GT to Tesla T10 provides no improvement, why ?

Related topics