What would happen with my program on Kepler with 1024 threads, 8 blocks per multiprocessor, 32 reg max? I could not spawn more than 8 blocks per multiprocessor. However, Kepler has 64K registers, so will some registers be unused? Or will cuda automatically switch to 64 registers per thread? Cause anyway registers are unused. Will this lead to performance degradation, using only 1024 threads on Kepler?
If your thread blocks have 1024 threads, you won’t be able to run eight of them concurrently per multiprocessor due to limits on the maximum number of threads per multiprocessor. To maximize the chance of achieving high occupancy, I would recommend shooting for a thread block size around 256 threads. Note that while higher occupancy tends to lead to higher performance, the correlation is not very strong and there are published examples of achieving good performance with low occupancy.
There is no need to use all the registers to achieve good performance (this is no different than on a CPU, where one might not use all of the FPU registers for a given piece of code, for example). For example, if your code is a streaming kernel performing relatively simple operations it may need only 20 registers per thread, but with enough threads running concurrently it will saturate memory bandwidth and thus run as fast as possible. For other codes, using as many registers as possible (e,g. via “register blocking”) will result in the highest performance, again this is no different from the situation on CPUs.
Of course I have 8 blocks with 128 threads, so 1024 in total, good configuration for Fermi with 32 cuda cores, while Kepler has 192 cores per multiprocessor. So how will it do with only 1024 threads and 32 regs per thread? I have existing program with such configuration.
Sorry, I misunderstood your configuration data. It’s impossible to predict the performance for a new GPU with different internal organization and performance characteristics based only on a few configuration parameters. Your app as-is maybe (completely or partially) limited by GMEM throughput, shared memory throughput, SFU throughput, double-precision throughput, integer multiply throughput, etc., all impacting scaling between Fermi to Kepler in different ways.
In terms of occupancy, I find that for our GPUs in general in many cases an occupancy between 0.33 and 0.5 is all that is needed to keep the machine full enough to achieve peak performance, as all latencies are sufficiently covered. On the other hand, in some instances I have found a performance difference between an occupancy of 0.88 and 1.0 (i.e. 7 thread blocks versus 8 thread blocks). As I stated in my previous post, occupancy is only weakly correlated with performance, and performance cannot be predicted based on occupancy alone. With 1024 threads per multiprocessor, I personally would not worry much about the occupancy aspect.
Thanks, I thought that this was some of a reason of some relatively poor benchmarks. Shared memory prevents more than 8 blocks and blocks are small so there are not many threads on sm. If I understand right, scheduller mixes instructions of 4 different warps now, so less ticks between execution of instructions of the same warp, so more warps are needed to cover memrory latency. That makes me worry that my program with 1024 and 32 warps threads are not tweaked for kepler, 64 warps are better.
Also how does maxregcount work, if I setup max 32 registers and a few threads per sm, would that threads automaticcaly get more registers? I suppose not.
The compiler flag -maxrregcount=[n] instructs the compiler (on a per compilation unit basis) to limit the code it generates to the use of at most “n” registers. This means that each thread would use at most “n” registers. Note that due to the granularity of per-thread-block register allocation in the hardware the number of actual registers reserved may be higher than the number of registers actually used. The occupancy calculator included with CUDA incorporates these granularities for the various architectures and should give an accurate estimate of occupancy achieved based on register and shared memory use.
In general I recommend relying on the compiler defaults as to how many registers should be used for a given piece of code. If one wants to limit register to less than the compiler defaults to achieve higher occupancy, I would suggest using launch_bounds which can be applied with function-level granularity and is thus a more flexible mechanism (which can also be adjusted based on target architecture). Often, trying to squeeze down register usage per thread by more than a 2-3 registers versus the target picked by the compiler will cause register spilling significant enough that it leads to lower app performance despite higher occupancy.
Doesn’t Kepler support 16 concurrent thread blocks per MP?
Yes, 16 blocks, however my old cuda program does not know it and spawn only 8 blocks per sm with 128 threads each block, so only 1024 threads instead 2048. Btw, is 128 threads per block good number for kepler, could schedullers take warps from different blocks? With one 128 thread block you cannot get 6 warps for schedullers.
I don’t understand the problem. Recompile the program for compute compability 3.0, then 16 blocks will be launched and full occupancy will be achieved.
Yes, I need to recompile and retweek program. However program is released and I have not kepler in my hands. Also if I have bigger blocks I should spend more shared memory, and make changes to cacheconfig too.
After tweeking I got 30% speedup to gtx580 while with out kepler was 5% slower.
Are you saying that your grid has only 1024 threads? Because that’s not even enough to assign threads to every core, let alone hide latency. Aim for at least 10x as many threads, 50,000 is a good number that should saturate devices for at least a couple generations to come.
No I had 1024 thread per sm.
Further upstream you mentioned that you are running thread blocks of 128 threads each. What is the the total number of such thread blocks launched per kernel launch, i.e. what is the total size of the grid you are launching? If the grid is comprised of too few thread blocks, there may not be enough total threads in flight to cover all latencies.
On Fermi, when code is memory bound one would want to target a grid size of about 20x the number of thread blocks that can run simultaneously on the machine to achieve the best possible performance. So if the GPU has 14 SMs, each capable of running 8 thread blocks concurrently, ideally one would want to run a grid with 8 x 14 x 20 = 2240 thread blocks or more. Grids with only a couple of hundred thread blocks could lose about 10% of peak performance for such codes. The more computationally intensive the kernel, the less “oversubscription” is required for peak performance.
I don’t know whether the same rule of thumb holds for Kepler, I simply have not had enough exposure to Kepler yet. Pre-Fermi cards could achieve full performance on memory bound codes with a smaller number of thread blocks in a grid, I often simply used a fixed number of 240 thread blocks per grid for simple streaming kernels on sm_13.
There are any number of performance limiting factors that play out differently between Fermi and Kepler, since they are different architectures. For example, for memory-bound codes, it may be useful to compare the raw memory bandwidth provided by GTX580 and GTX680:
Sorry, I misprinted, I run 8*128 threads on fermi on each sm, after I changed to full occupacy on kepler I got speed up otherwise it was 10% slower on my code.
How do you control the number of threads or blocks per SM?
With occupacy calculator, i check shared memory using and register using and see how many blocks I could run max.