I’m currently trying to trade off the performances of my code (use Dynamic Parallelism and Warp Shuffle in the code) to find the most efficient one. I am working on Windows 7, the platform is Visual Studio 2010 and the Nsight is 18.104.22.16809 (the latest one). My strategy here is that find the most balanced usage of registers and shared memory.
From the experiments I did on Nsight, I noticed that, there is a batch of 256 bytes shared memory set to each kernel function initially, and 32 bytes per thread for the execution. For the registers, I didn’t find any initial occupancy. Therefore, I got two formulas as below to calculate the best usage of registers of different block size, and I use launch_bounds to set the block size and the grid size.
Register: BlockSize * NumberOfBlocks * NumberOfRegisterPerThread <= 64K
Shared Memory: 256 * NumberOfBlocks + 32 * BlockSize * NumberOfBlocks <= 48K
With the two formulas above, I can get some theoretical result, however, it’s different from what it is shown in Nsight. For example, if the block size is 448 (14 warps), from the formulas, the maximum number of registers for each thread is 48 and there are 3 blocks. However, the compiler only allocates 40 registers to each thread. Another strange thing is that, If I set the maximum number of registers in the property setting of my project but not use launch_bounds, 48 registers can be set to each thread, but the number of blocks is reduced to 2.
According to the information I list above, it seems like there is a initial setting on registers just like the shared memory. However, I don’t find any information can prove this conjecture. Anybody met the same problem before? Or anybody have any idea about it?