From low end GPUs to high end GPUs Moving from 9600GT to Tesla T10 provides no improvement, why ?

Yes, I also do not understand what this scaling is good for. I don’t know if there is a difference in the Linux and Windows versions, but so far I’ve only looked at the tabulated absolute counter readings when I’ve profiled my own code.

If the code is 55% computation bounded on the 9600GT, it would be memory bounded on the Tesla. That might actually get close to the numbers you are seeing.

So accesses within a warp would be to consecutive elements, but aligned only if [font=“Courier New”]kernel_data.stride[0][/font] were a multiple of 16. This apparently is the case, as otherwise the reads would not be coalesced on the 9600GT. Is this reasoning correct?

Non-aligned accesses would probably explain some of the discrepancy, as obviously they need different treatment in read and write accesses (this might also explain the asymmetry in the performance counter definitions).

By the way, type (size) is [font=“Courier New”]kernel_data.data[0][/font]?

Could you explain what is your reasoning behind this ?

I’m using cudatemplate’s Cuda::DeviceMemoryPitched3D of type int (that 32 bits as indicated in the CUDA programming guide). Up to my understanding (and speed tests) pitched memory takes care of aligning the memory.

Never mind. Reading your initial post I see that I misunderstood your post. 55% is the time spent in the compute-bound kernel, not the amount of computation in that kernel.

Which makes me wonder however whether the whole program is dominated by other delays like PCIe transfer times. I find the “GPU time width plot” with timestamps enabled quite useful to assess that. Comparing those for both devices might give an idea which part of the program falls short of the speedup expectation.

Thank you all for your participation on this thread. Via this small reply I will try to summarise the “final answer” to my two original questions.

Regarding the weird profiler output between the two GPUs this was explained because:

    The profiler plots per function provides counts not timings

    The counts are scaled depending on the kind of memory access they do. So the counts between memory access and instructions cannot be directly compared. To my eyes this is a (documented) bug of the profiler.

If finally noticed that the performances does increase, but only when I use a larger data volume. When doubling the data volume, then the Teslta GPU provides a 2x performance boost over the 9600GT GPU (and a 80 times boost compared to a single CPU core).

So the lesson was: when testing for speed between GPUs, test it will “large data volume”.

Probably at low data volume my code is not correctly occupying all the resources of the GPU, or some kind of latency is taking over compared to computation time. The fact is, that I mainly care about the large volume case, and I was only testing on small volumes to iterate the tests faster. No I know that this was a mistake.

Again, thanks you all for the support.