I’ve been doing some testing regarding the feasibility of launching CUDA algorithms through JNI. I am primarily interested in how the JNI overhead and large data handling between native methods and JAVA implementations impact overall performance, and do the benefits diminish. In one of the tests I was comparing the performance of parallel sort implementation by Alan Kaatz (Source) and a fast JAVA quicksort implementation. The resulting benchmark numbers for GPU weren’t consistent and exhibited sharp spikes in performance degradation. Here are the graphs:
The GPU is NVIDIA 8600M-GT and CPU is Intel Core2Duo 2.2Ghz. The performance spikes occurred between following ranges of elements:
30.000 → 40.000
60.000 → 70.000
130.000 → 140.000
260.000 → 270.000
520.000 → 530.000
1.040.000 → 1.050.000
2.090.000 → 2.100.000
etc.
The data set is simply an array of floats. My question is, would this have anything to do with the intricacies of the actual algorithm implementation, or the hardware and the way data set size maps onto the memory grids? I am still a beginner when it comes to CUDA so any tips or insights would be very welcome. This is a part of my university project so I would like to be able to explain the behaviour :) Thank you in advance!