We are doing a research in The Chinese University of Hong Kong to accelerate data mining algorithms by using GPU.
In the project, for PValue algorithm, data are parallelized into blocks, and then a long for loop (65535) of every blocks are separated into small parts and parallelized into threads.
The GPU that we are using is NVIDIA K20M with 5G memory and in 32 block size.
We suppose the best performance happens somehow between 1 to 1024 threads because of several atomic add and synchronization operations. Hence, we run the Pvalue algorithm in different number of threads from 1 to 1024. And plotted into the following graph (y-axis: processing time, x-axis: number of threads).
We know the zig-zag shape is caused by warp size, as CUDA works best when thread numbers are in the multiplications of warp size.
The problem is, we cannot explain the sudden increase during thread number 641 to 799. Is it because starting from 641, CUDA ran out of local memory and starting to use global memory? But why the processing time dropped after 799?
Do you know why?? Any hint is appreciated. Thanks a lot.
Source code and the processing time are provided in the following links for your reference.