Hello, I have a kernel of about 300 lines, with a lot of syncs and nested loops.
Since I’m bound to register count, I’m trying to reduce the register count without losing performance. The kernel takes as input an image and a conv neural network and it goes as follows:
I have 13x13 threads
Each thread copy 4 pixels of the 28x28 image to smem
Each thread also copy a part of the neural network to smem
Syncthreads
Each thread then compute the convolution of that 4 pixels and apply the max pooling by keeping only the higher one, saving it to smem
Syncthreads
Now, only one thread per block executes the operations to get the output layer with these loops:
Now, the whole kernel takes 40 registers. If I remove the #pragma unroll the register count goes down to 32 (but the execution time increases by 5%). If I remove the two loops, the register count goes down to 18.
I don’t understand why -on a 300 rows kernel with a dozen of ifs and loops- these 4 lines took more than half of the kernel registers (22/40). I also tried to look at the ptx but it’s very hard to grasp the compiler optimizations. I tried to refactor this code in some ways, but I always end up with 40 registers.
Do you have any tip of how can I investigate this further?
You will need to look at the disassembly of the actual machine code (SASS). PTX is only an intermediate format that is translated into SASS by an optimizing compiler, ptxas, which also performs register allocation. Use cuobjdump --dump-sass to extract the machine code. Sufficient back annotation of the machine code with bits of source code will take several hours, it is a tedious process (applies to all processor architectures using modern compilers).
The textual description of the code that is provided does not allow a third party to conduct the analysis you desire. You would have to post a minimal complete code that reproduced the issue. But at 300 lines you will have few takers to analyze the machine code, as that requires (partial) backannotation, a tedious task.
40 registers does not strike me as particular many. Keep in mind that GPUs are 32-bit machines with support 64-bit addressing, so any array will require a 64-bit pointer occupying two registers.
In order to enhance the latency tolerance of code, the compiler will often move load instructions up, lengthening the live range of the data and likely requiring additional temporary registers. Loops often lead to the creation of additional induction variables as part of code optimization, and these likewise need additional registers.
Based on extensive personal experience: This is rarely possible if one attempts to squeeze down the compiler’s automagic selection by more than two register. Generally, the CUDA compilers make good trade-offs in trying to maximize the performance.
I’ve tried that, and while the number of registers changed, it gets me a “invalid parameter” cuda error when I launch the kernel at runtime. I’ll have a look into it
Based on extensive personal experience: This is rarely possible if one attempts to squeeze down the compiler’s automagic selection by more than two register. Generally, the CUDA compilers make good trade-offs in trying to maximize the performance.
I thought so, what I meant was that I wonder if there is a way to write these loops differently to let the compiler optimize them better. I’m just a bit weirded out that this particular loop uses so many more registers than the other.
If you want to look at the complete kernel code, here it is: https://github.com/EmmanueleVilla/ga_cnn/blob/master/network/fitness_calculator_gpu.cu
Thanks for the info
I tried to change approach of those loops, and in fact I got not only less registers, but also a very high speed up by removing the inner one and change my logic. Thanks for your input!
In terms of general advice, I would avoid peppering code with instances of #pragma unroll. It can easily help with one GPU architecture with one version of the toolchain, and be counterproductive with other architectures or CUDA versions, making the code hard to maintain.
In general the loop unrolling heuristics of the CUDA compiler work well, although I have noticed a tendency in recent versions (since 11.x) to what I consider over-aggressive unrolling with at best marginal benefits.
It is often advantageous to unroll loops with small trip counts known at compile time completely to enable further optimizations. This may lead to scalarization of small arrays involved, e.g. here the compiler might create new variables sums_0, ..., sums_9 that can then be assigned to registers, driving up register usage. I have not checked whether that happened here. I don’t know whether the use actually need multiple partial sums. Would a single (scalar) sum work?
In terms of general advice, I would avoid peppering code with instances of #pragma unroll . It can easily help with one GPU architecture with one version of the toolchain, and be counterproductive with other architectures or CUDA versions, making the code hard to maintain.
Ah, good to know… I will try to remove it and use a manual stride to check if things goes better
Would a single (scalar) sum work?
Unfortunately I need the partial sums, because I need to know in what index of the array the result is higher.
I put this sum in a for loop because I thought that would be too much memory concurrency to do this operation in parallel… now I tried to take the first 10 threads of the block and use each nth to calculate the nth sum and while now I have 3-4 cycles per memory store instead of only one, the total execution time halved (and so the registers from 40 to 32).
Now I know that I can not rule out any solution beforehand :D