I’m working on increasing kernel concurrency (testing on M1000M, compute 5.0). The first problem I ran into was any synchronization call or interspersed memcpy or short duration kernels was causing awful fragmentation of kernel launches. From what I’v read, I believe this is due to the batch processing of WDDM. To get around the problem I serialized all 128 independent streams to a single CPU thread. While blocking all CPU threads is not ideal, the concurrency improvement on the GPU lead to something like 1/5 total run-time.
Next I noticed that due to the fact that my kernels are IO bound, decreasing the number of threads per block does not impact single kernel performance much, but it seems to further improve concurrency. However, I’m running into an unexpected limitation somewhere. I can’t seem to get beyond four blocks per SM. Even when I drop to a single warp per block (which uses 128*32 of 65536 registers), I see a maximum of 16 blocks running concurrently divided over four SMs.
So I’m not limited by registers, threads, or blocks per SM. I’m serializing all kernels to a single thread; so, number of connections shouldn’t be an issue. It is suspicious that I’m limited to 16 blocks per SM due to the number of registers, but this should be only an SM limitation, not a GPU wide limitation.
Edit (slightly off topic)
Alternatively, if I can’t increase concurrent kernels and increasing threads doesn’t help, it might be possible to increase throughput by using more registers. Visual profiler reports that most memory transactions are local. Almost all arrays are statically indexed, and I’m only using a small percentage of available registers, but I can’t figure out a way to force the compiler to use more. I tried launch_bounds and setting maxregcount to no avail. I suspect very long loops make the code hard to analyze.