How do thread numbers affect processing time?

Hi all,

We are doing a research in The Chinese University of Hong Kong to accelerate data mining algorithms by using GPU.

In the project, for PValue algorithm, data are parallelized into blocks, and then a long for loop (65535) of every blocks are separated into small parts and parallelized into threads.

The GPU that we are using is NVIDIA K20M with 5G memory and in 32 block size.

We suppose the best performance happens somehow between 1 to 1024 threads because of several atomic add and synchronization operations. Hence, we run the Pvalue algorithm in different number of threads from 1 to 1024. And plotted into the following graph (y-axis: processing time, x-axis: number of threads).
http://appsrv.cse.cuhk.edu.hk/~pschu/cuda/pvalue_threads.png

We know the zig-zag shape is caused by warp size, as CUDA works best when thread numbers are in the multiplications of warp size.

The problem is, we cannot explain the sudden increase during thread number 641 to 799. Is it because starting from 641, CUDA ran out of local memory and starting to use global memory? But why the processing time dropped after 799?

Do you know why?? Any hint is appreciated. Thanks a lot.

Source code and the processing time are provided in the following links for your reference.
http://appsrv.cse.cuhk.edu.hk/~pschu/cuda/GpuAcceleratedPValueProcessor.cu
http://appsrv.cse.cuhk.edu.hk/~pschu/cuda/pvalue_processing_time.xlsx

But why the processing time dropped after 799?

it may be because number of thread blocks on each SM dropped (say from 3 to 2) so contention for data cache is decreased

I looked at your code and the chart, but I didn’t look at the code in too much detail so I’m sorry if I’m raising questions that should be obvious.

In general, the performance is getting worse as you add more threads. This would seem to be due to contentions caused by atomic operations - more threads would mean more contentions. However the performance also improves as you add more threads. This might be due to the fact that certain operations can be parallelized instead of having to run sequentially.

The plateau between 641 and 799 could occur in at least two cases. (1) there is a specific problem that occurs at 641 and the problem goes away after 799. (2) A problem occurs at 641 - maybe contention caused by atomics - and something good happens after 799 - maybe a thread multiplier effect that means the system doesn’t have to wait as long for a synchronization to occur.

To use a simple toy example, if I have 100 things to do and I am parallelizing them, I should see an improvement at 50 threads because each thread only has to do two things. At 49 threads, some of the threads will have to do three things and the system will wait longer for the synchronization. You’d then see another improvement at 100 threads.

So some questions.
(1) Have you tried extending the graph beyond 1024 to see if you get similar drop-offs to the one at 800? Maybe there is another dropoff later. Or using a problem size different from 65535?

(2) Have you tried adding or removing atomic operations to see if you can change the appearance of the graph? Is it possible to associate the spike at 641 and/or drop at 799 with a specific atomic or synchronization operation?

(3) If you put red vertical lines on your chart - at the points where 64K is an integer multiple of the number of threads, then how do the red lines correlate with the spikes in the blue chart line? Where they don’t, it must probably be something else such as the local memory vs. global memory issue you mentioned.

Thanks,
Erik