I have a weird behavior of my CUDA code.
I have a matrix with N rows and M cols.
I’d like to sort independently each column.
So, I do an horizontal grid to threw on thread per column.
Each thread use a combo sort algorithm.
My problem is that when I increase the dimension M, the time spent to sort my matrix change but with a weird behavior. Let’s have a look on the enclosed figure. When M is a multiple of 8 or 16, the computation is 2 time faster!
Thanks for your help.