Weird performance case

Hi,
I’m trying to improve an algorithm and I came across in a weird behaviour that I don’t understand.
I have a matrix and the algorithm I working on transform a 2x2 piece into another 2x2 piece.
I implemented 2 version. In the first one, every thread handle a single 2x2 piece while the second handle 8 2x2 consegutive pieces. In this second impementation I calculate the addresses of the first 2x2 piece and then simply increment the pointers for the following 7 pieces. So, I expected that this second implementation to be faster or the same. Instead it is 10-20 time slower.
I don’t understand why it is so slow.
Can anybody give me some clues ?

Paolo

any of the GPU profilers (e.g. nsight compute, nsight systems) may shed light on it.

How are you measuring the performance? The performance difference may be a result of an invalid measurement such as the first iteration was eliminated by dead code removal.

Using gtest framework, I made 2 tests, one for each implementation. Then I used NSight for run 20 time the tests.
I also took the time in the my real application and I got similar results.

Thank for the suggestion, but I don’t understand how nsight can help me. I mean, I’m not an expert for this tools, but the only interessing information about this problem I got is the “Cuda GPU Kernel/Grid/Block Summary”.
Did You had something else in mind to look at ?

Since you have provided no code, whatsoever, the suggestion I made is very general. When I want to learn about a code from the profiler(s), I will generally start with nsight systems, and look at the following things in the timeline:

  1. The location of all cudaMemcpy operations.
  2. The location and duration of all kernel calls.
  3. Any conspicuous gaps in the profiler timeline for these activities.

Given that you are comparing two codes/versions, I would compare the timelines in both cases. Given that you have said one is 10-20 times slower, I would expect that it should be fairly easy/quick to spot what is different. Are the cudaMemcpy operations in roughly the same sequence and roughly the same duration? If so, are kernels in the same location and of the same duration? Are the gaps the same?

If the kernels are of different duration (for example), then you might switch to using nsight compute at that point. Let’s also say that the code structure at a high level is the same - there is a 1:1 correspondence between kernels. Then I would profile the kernel that got longer in the fast case, and the slow case, and I would set one as a baseline in nsight compute, and use that to see where nsight compute thinks the major differences are in its standard reports. From those major differences, I would then start to draw some hypotheses about what has changed.

And that would drive further efforts. I won’t be able to spell out a complete process for you, because it depends very much on the data you find along the way. But you may get some ideas e.g. from this blog series. I consider most of what I have described here to be “standard profiling” or “standard code analysis” using the profiler. With no code and no description other than what you have provided, I won’t be able to be any more specific.

You’re also welcome to ask profiler questions on the respective forums for the profilers. Good luck!

Thank for the useful answer.
Also if it do not resolve my specific problem, you give me a lot of suggestions to use also in other cases

Could be that in that version the global memory accesses are not coalesced any longer.