I write a kernel on A100 and scheduled to use less then 128 registers per thread (cause I allocate 100KB smem per SM and use 25KB per threadblock, each has 4 warps). However, the compiler use stack frame instead of registers to store my data. Here is my declaration for registers:
ptxas info : Compiling entry function 'gemm_reg_mma_m16n8k16_pts_256' for 'sm_80'
ptxas info : Function properties for gemm_reg_mma_m16n8k16_pts_256
256 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 60 registers, 24800 bytes smem, 404 bytes cmem[0]
Can you find out, whether the reason is your limitation of the number of registers (even if 68 are free) or (what remains) your accesses?
Are any of the indices no compile-time constants or different between the threads? E.g. they depend on threadIdx or a for loop iteration variable? For the latter case, do you unroll the loops? Or do you take pointers to array elements?
Sometimes the reason is also parameter passing to device functions, in this case I would use __forceinline__.
I don’t understand the meaning of “Or do you take pointers to array elements?”. Could you please give me an example? I don’t understand “parameter passing to device functions” too.
When I commentted all the lines related to out_feats, the use of stack frame is go to 9.
If pal_idx_1/2 or own_idx_1/2 are dependent on threadIdx, you cannot use registers. The register indices have to be known at compile-time and have to be the same for all threads. They are directly encoded into the instruction. Also there are 2 (?) register banks and ptxas has to take care that they are not accessed at the same time (could also be that a concurrent access is possible, but slows down).
Either rewrite the algorithm, so they are not dependent on threadIdx any longer,
or use __shfl_sync to load by index (using thread index 0…31)
or use shared memory (even if you are already limited)
or use global or local memory and rely on the L1 cache (which is difficult to do in a coalesced way, if all threads have different indices)
or try to rework the algorithm to use the tensor cores for resorting data.
I don’t understand the request of shfl.sync() yet. I used to believe that it can work without local memory if the identifier in the function is determined. So I move the data into determined identifier src.
In my opinion, src is a determined register, it just have different value at different times.
Do you mean that the index must be determined? Not just src?
You do not have to use __shfl_sync. It is a tool to get data by dynamic indices:
You have 32 src values, one for each of 32 threads of a warp.
With __shfl_sync each thread can dynamically select one of the values (by choosing from which thread we read from).
This is possible without local memory.
I don’t get it. What confused me most is that I don’t know why if pad_idx / 2 & own_idx_1 / 2 are dependent on threadIdx, the compiler will choose to use local memory.
I mentioned shfl.sync() just because I guess maybe the dependency conflict with the use of this function.
Do you mean that shfl.sync() will use local memory anyway ?
“The register indices have to be known at compile-time and have to be the same for all threads. They are directly encoded into the instruction” I’m sorry, I still don’t understand which function has this requirement?
The compiler has to be able to convert an array into individual variables.
I.e. out_feats[idx] into half2 out_feats0, out_feats1, out_feats2, ...
If this is not possible, it has to use local memory.
The variable has to be the same for each thread.
If you use calculated indices for idx and the compiler cannot reduce it to a constant number, it cannot point to a specific variable.
Whereas with shfl.sync() you can load data from any of the 32 threads (your own thread or the 31 others) and you can use a dynamic calculated number to specify from which thread.
I get it. So it is hard to use registers to store the mma operation results as I need to know the absolute index of register I need each time, which can be hard especially in sparse conditions.
I still have one question about shfl.sync(). If many threads require data from a same thread, will it cost much more time then each thread requires data from different thread?
If your input matrix is sparse, you can know beforehand or resort rows/columns to only use non-zero elements.
If your output matrix is sparse, I would just save everything or use an algorithm with shared memory to detect empty elements.
If your output matrix is not fully needed and you know the elements beforehand, you can try to create a hard-coded algorithm, but it will be difficult (without shared memory), as each thread has different matrix elements after mma.
No, shfl.sync() will always have the same speed regardless, whether you use it for broadcasting or each value is read once or something in between.
I want to write code in this way because I still want to use registers to store the value of points’ different channels. I need to store 128 points with 16 channels per warp, so I need to store 4 points with 16 channels per thread. Then, before each mma operation, I can use shfl.sync to get the val I need from particular threads.