Hello ! Im currently profiling one of my kernels. I noticed that the amount of data written to global memory is quite a lot while I was expecting it to be just less than 10MB for my expected kernel output.
Upon further profiling I noticed there were a lot of writes to local memory:
There are around 20K expected global stores. But there are more than 71,000K local stores !. According to the profiler this seems to be stalling the warps quite a lot and they are unable to hide the latency.
The problem is located in the following function (just showing the beginning here). As you may see on the right, the variables buffer1 and buffer2 seem to be stored in local memory for some reason. The function is called with layer_size=32 and Im expecting each thread to have at least 255 registers available which is less than the 64 registers both buffers need. Any idea why my compiler is storing the buffers in local memory ?
Also, running the code with -Xptxas -v shows the following output:
ptxas info : Compiling entry function '<mykernel>' for 'sm_86'
ptxas info : Function properties for <mykernel>
256 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 128 registers, 528 bytes cmem
Which confuses me, since there is no register spilling which I was expecting and the number of used registers suggests there is still room for 64 more registers. However the stack frame is 256 bytes which may suggests that’s where the variables are being stored (64 x 4byte each). If that’s the case, why ? They are just local arrays.
All thread-local data is placed into local memory by default. As an optimization, the compiler can often pull some of it into registers. This is problematic when the local data is an array: there is no such thing as indexing into the register file. The only way a local array can be scalarized as a necessary precondition to being placed into registers, is when
(1) it is sufficiently small (for the compiler’s implementation-specific definition of “sufficiently small”)
(2) all indexes used to access data in the array can be computed at compile time
Think of scalarization as the compiler replacing each a[n] with a scalar variable a_n. The compiler will attempt to allocate these scalars in registers, if enough registers are available. If that is not possible, it may (temporarily) push these scalars back into local memory, a process called spilling.
With strenuous squinting (please post code as marked-up text, not as images) I can make out that the kernel in question uses two local arrays of 64 floats each. That probably exceeds the size limit imposed by the compiler. The accesses themselves seem scalarizable in principle once the loop is fully unrolled. You can try #pragma unroll 64 if the compiler does not do that by itself. Look a the generated SASS (= machine code) with cuobjdump --dump-sass.
Side remark: o is a terrible choice for a variable name, because at the resolution of my screen this is hardly distinguishable from 0.
As an optimization , the compiler can often pull some of it into registers
I see, I thought this was the default case for any variable but I see the default is local memory
You can try #pragma unroll 64 if the compiler does not do that by itself
Eventhough the default size for the arrays is 64, Im currently using size 32. However, you are right, fully unrolling the loop has the desired effect of making the compiler try to optimize the arrays into registers.
After profiling, I can see the amount of data written back to memory is now less than 200MB which makes sense to me. Moreover there are no load/store from local memory instructions in this section of the code anymore.
As a side note, the code now runs slower so the compiler was initially right about not unrolling the loops and using local memory (I havent profiled everything yet, Im thinking maybe because instruction fetch misses, could be other reasons too).
please post code as marked-up text, not as images
Apologies, I thought in this case it was okay since I was using profiler captures.