Max blocks per SM less than expected

I’m working on increasing kernel concurrency (testing on M1000M, compute 5.0). The first problem I ran into was any synchronization call or interspersed memcpy or short duration kernels was causing awful fragmentation of kernel launches. From what I’v read, I believe this is due to the batch processing of WDDM. To get around the problem I serialized all 128 independent streams to a single CPU thread. While blocking all CPU threads is not ideal, the concurrency improvement on the GPU lead to something like 1/5 total run-time.

Next I noticed that due to the fact that my kernels are IO bound, decreasing the number of threads per block does not impact single kernel performance much, but it seems to further improve concurrency. However, I’m running into an unexpected limitation somewhere. I can’t seem to get beyond four blocks per SM. Even when I drop to a single warp per block (which uses 128*32 of 65536 registers), I see a maximum of 16 blocks running concurrently divided over four SMs.

So I’m not limited by registers, threads, or blocks per SM. I’m serializing all kernels to a single thread; so, number of connections shouldn’t be an issue. It is suspicious that I’m limited to 16 blocks per SM due to the number of registers, but this should be only an SM limitation, not a GPU wide limitation.


Edit (slightly off topic)

Alternatively, if I can’t increase concurrent kernels and increasing threads doesn’t help, it might be possible to increase throughput by using more registers. Visual profiler reports that most memory transactions are local. Almost all arrays are statically indexed, and I’m only using a small percentage of available registers, but I can’t figure out a way to force the compiler to use more. I tried launch_bounds and setting maxregcount to no avail. I suspect very long loops make the code hard to analyze.

It could be a slightly offtopic, but I think the actual number of registers is way, way less. You have only like 255 registers max per actual array being processed, for most recent architectures, for previous architectures it it like 65, 127. The threads are marketing terms, there are no threads, there is a large array of 4 byte values, which are processed simultaneously. Even worse, latest models process 2 or more arrays at once, which effectively make them one huge array, but for each array there is its own subcommand. So, given an array size of 32, and the number of processed array of 2, you would have like only 1k registers, and they are extremely slow, they are split to many parts to hide the latency, so finally you would have like 4 groups of extremely large registers of 255 actually useful per array processed (255 arrays, 256 bytes each, 1 being filled with 0). It is not exact, I just made some theoretical considerations similar to the actual state of things.

I’m not sure I follow. Do you have a link to documentation? Most arrays I’m using are relatively small (the largest is 16 float4s). The issue is that the loop is long, so the compiler loads from global memory, stores temporarily to local to do intermediate processing, then loads again from local. At least, I think that’s what it’s doing. The load and stores units are the primary bottleneck. The compiler seems determined to use 128 registers almost no matter what I tell it.

As far as I can understand, no matter the array size you are using, it is processed as 32-sized array of 4 bytes each of its values, which is labeled as marketing term warp size. Or a few of them. Each register is not 4 bytes, but 32 times larger. As I said, it is offtopic, as I’m exploring the architecture myself, and facing a lot of marketing lies. Anyways, I mentioned that it is mostly offtopic, according to the documentation, you have 255 actually usable registers per thread, actual thread. It could be that if the compiler finds some dependency, the number of registers is halved, because of splitting the instructions into 2 subinstructions (and they deal with 2 sets of registers, effectively increasing the actual register size by 2 and having 2x less registers for that reason for single thread packed into 2 commands format - I could theorize that it would be better to leave the second command as no operation in such case, but the compiler logic may be twisted, also, there is a possibility that memory write could be defined for a half of phisically allocated registers, one array write per cycle, while a command is divided into subcommands addressing 2 sets of registers), - consider this offtopic, as I’m not sure myself.

http://docs.nvidia.com/cuda/maxwell-tuning-guide/index.html#nvidia-maxwell-compute-architecture

Or latency is the issue, some (actually most commands for GPU) commands may take a lot of time to execute, while you have a few k of registers, for 4 or 8 register sets specified for each of the them it would not cover the latency, while 2x largers sets may be enough from the compiler logic, leaving you with 128 registers only. Eventually, the more memory areas you have, the less is the size of a single memory area. However it is offtopic, don’t take it as an answer.

I moved all the kernel calls out of CPU threads and into GPU thread blocks. When I decrease the number of threads from 128 to 32, there is a run-time reduction of 10%-20%. Since I can no longer see each individual block in visual profiler, I can only infer that more than four blocks are running simultaneously per SM. So it’s not conclusive, but it seems that there is some kind of limitation in WDDM and not the GPU itself.