Lower performance with bigger size of grid


I did some tests of my application, and I noticed that for some grid sizes the performance decreases with the increasement of the grid size. Why the GPU performance does not grow continuously with the increase of the problem size?

64 64 64<----->61,51927742
128 64 64<—>112,3474286
128 128 64<–>146,1377471<—>sm_efficiency_instance (AVG) 91.51%
256 128 64<–>141,9235423<—>sm_efficiency_instance (AVG) 84.16%
256 256 64<–>164,2279957<—>sm_efficiency_instance (AVG) 91.51%
512 256 64<–>175,8453341
512 512 64<–>175,1639587
1024 512 64<->175,4220063

These tests were executed using GeForce GTX TITAN. The similar trend is for Tesla K20X, and Tesla K80.
I tried to test it with different sizes of dimensions, where only one dimension was changed but it returned the same results.

This is impossible to diagnose without knowing a lot more about the code in question, or ideally having access to the full code and running it through the profiler.

Here is just one hypothetical scenario: The partitioning of compute activity into blocks and threads, coupled with the data access pattern of each thread, coupled with the details of multiple hardware scheduling mechanism, creates particular streams of memory activity presented to GPU memory controllers. The efficiency of the memory access by those memory controllers is a function of the particular stream (e.g. access size, addresss pattern, order of reads and writes). If any of the parameters outlined above change (e.g. grid size or number of threads per block), the efficiency of memory access may be affected.

Similar effects may affect the efficiency of using other hardware resources. The interactions between numerous decoders, schedulers, buffers, re-order mechanisms, in the hardware is generally difficult to predict. This phenomenon is not unique to GPUs, but affects every sufficiently complex computing device and system. In some instances it may be possible to develop a mental model of what is happening with the help of a profiler, but in many instances the interactions are so complex that this is not possible. One way to deal with this in practical terms is to add auto-tuning to software, which in the configuration stage tries many different combination of software configuration parameters and keeps the best ones found. This approach was first pioneered by BLAS and FFT libraries on CPUs some twenty years ago.

I would also guess that it has something to do with data access patterns created by the work breakdown structure associated with grid (size/dimensions).

Most ordinary loading effects should be out of the picture at a grid size of (64,64,64) (that’s 256K blocks!) I’m kinda surprised there is any efficiency gain (Gflops/s throughput increase) above that. Unless you have something terribly broken in your threadblock structure, like threadblocks of 32 threads or less.

Not knowing anything about the application and the measurement methodology used, the differences in timing could also be (partially) due to a flawed timing framework. Memory-intensive workloads in particular tend to show larger variations from run to run on both CPUs and GPUs, and their timing typically need to be stabilized by using a “minimum of ten invocations” methodology (as used by the STREAM benchmark) to get meaningful performance comparisons.

As Norbert said, its very hard to determine the root cause, but Its fun to make speculative guesses :)

One effect that used to give such a staggering performance with grid dimensions was/is partition camping (Think of memory bank conflicts in shared memory but for large global memory partitions). I’m not sure how relevant PC is these days as Nvidia supposedly implemented a work around with Fermi and later arches.

Anytime one deals with a memory structure consisting of multiple banks / channels / partitions one runs the risk of encountering pathological cases that cause conflicts and result in significantly reduced performance. The solution typically used by processors is to scramble address bits (i.e. produce an address hash, usually a simple one with a few XORs) to avoid conflicts for commonly encountered access patterns. So the likelihood of hitting a low-performing access pattern is much reduced but not zero.

njuffa: I applied here an auto-tuning method with some set of parameters. I repeat these tests multiple times to achieve valid results of execution time, I have also warmed-up the GPU before collecting the data. The number of flops are measured with the NVIDIA profiler.

Thank you all for valuable comments. Thanks to you I decided to make another set of tests. Now I’m pretty sure that it must be resulted from the mechanism scheduling. I test the grids of size from 96x…x…, 104x…x…, 112x…x…, … , 256x…x… When I divided the number of threads by the number of SMX (14 in GTX TITAN), I have got the local peaks perfomance every time, when the remainder is close to 1 (when reminder is 0 then I replace it with 1). Here are my results:

Gflop/s<->reminder of (#threads/14==0)? 1 : #threads/14
170,9<—>1,00 --> local peak
170,5<—>1,00 --> local peak

it’s a so-called “tail effect”. you may run 10-100x more threads to mitigate шею otherwise, you may shedule work in 2 or more streams - it will result in tail of one job executed simulatenously with head of next job