Sorting performance peculiarity

I’ve been doing some testing regarding the feasibility of launching CUDA algorithms through JNI. I am primarily interested in how the JNI overhead and large data handling between native methods and JAVA implementations impact overall performance, and do the benefits diminish. In one of the tests I was comparing the performance of parallel sort implementation by Alan Kaatz (Source) and a fast JAVA quicksort implementation. The resulting benchmark numbers for GPU weren’t consistent and exhibited sharp spikes in performance degradation. Here are the graphs:

The GPU is NVIDIA 8600M-GT and CPU is Intel Core2Duo 2.2Ghz. The performance spikes occurred between following ranges of elements:
30.000 -> 40.000
60.000 -> 70.000
130.000 -> 140.000
260.000 -> 270.000
520.000 -> 530.000
1.040.000 -> 1.050.000
2.090.000 -> 2.100.000

The data set is simply an array of floats. My question is, would this have anything to do with the intricacies of the actual algorithm implementation, or the hardware and the way data set size maps onto the memory grids? I am still a beginner when it comes to CUDA so any tips or insights would be very welcome. This is a part of my university project so I would like to be able to explain the behaviour :) Thank you in advance!

I have fine-grained the iterations and found that every drop in performance occurs when the number of elements is exactly (2^n)+1. Also, at 2^n elements it is fastest when compared to CPU performance. What I also noticed is that the sort always pads the dataset with negative infinities so that the number of elements allocated on the device is always equal to 2^n. This explains the drop in performance since at (2^n)+1 number of elements that are actually being sorted is 2^(n+1).

Would I be correct in saying that memory can only be allocated in blocks that are powers of two, and that the kernels can only operate on blocks in their entirety? Thanks!

No, not correct at all in the general case. Rather, it is probably some specific algorithm choice in the sort you are using the requires a power of two data set. CUDA can certainly work on data sets of any size, with linear performance scaling once there are enough concurrent blocks to fully occupy all MPs. (The scaling is stair-step for smaller data sets). You can see an example of linear scaling in a kernel in the paper here: