Need explanation

I have a weird behavior of my CUDA code.
I have a matrix with N rows and M cols.
I’d like to sort independently each column.
So, I do an horizontal grid to threw on thread per column.
Each thread use a combo sort algorithm.

My problem is that when I increase the dimension M, the time spent to sort my matrix change but with a weird behavior. Let’s have a look on the enclosed figure. When M is a multiple of 8 or 16, the computation is 2 time faster!

Thanks for your help.

It seems like it has to do with alignment - not all memory access patterns are created equal. When your matrix rows are multiple-of-8 sized, each next row gets aligned more cleanly in memory, giving threads faster read access, and probably write access too. Can that explain it?

What’s the structure of your arrays like?

Of the sample projects, at least the convolutionSeparable tries to align things on boundaries of 16 bytes, IIRC.

Run your code through the visual profiler and check the number of coherent (coalesced) reads and writes versus the incoherent (uncoalesced) ones. I would guess that when your matrix is a multiple of 8 or 16 you get coalesced reads boosting your memory performance. You should be able to pad the thread block or matrix memory (see cudaMalloc2D) in order to always achieve coalescing.

With your advices, this is the results I now obtain :

x-axis : number of points
y-axis : time
red : old version
blue : new version

As you can see, I have fixed the problem!
Thank you all.