Question about : Kernel optimization , ptaxs register usage, branch divergence, warm up kernel runs

,

Details of hardware working on :
(1)A100 Tensorcore GPU, A 100 tensorcore whitepaper
(2)static shared memory per SM : 48KB for a SM,
(3)maximum concurrent threads per SM = 2048,
(4)maximum threads per block = 1024,
(5)maximum registers per SM = 246KB = 65536 (32 bit) registers,
(6)maximum registers a thread can use 255.

I am in the process of optimizing kernel code and this is my first attempt working on optimizing cuda kernel and I have a few questions. I will appreciate if you can help me with them

Section 1 : When I compile my code with the option -Xptaxs = -v, the output is the following:

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z23find_block_pivot_kernelILi32ELi32EEviiPfiPiiS0_S0_' for 'sm_80'
ptxas info    : Function properties for _Z23find_block_pivot_kernelILi32ELi32EEviiPfiPiiS0_S0_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 32 registers, 4480 bytes smem, 408 bytes cmem[0]

my blocksize - dim3(32, 32) - 1024 threads

For 100% occupancy, a thread block with 1024 threads can use at most 32 registers. If “-Xptaxs = -v” reports all registers used per thread per kernel function, then 32 registers per thread will provide 100% occupancy. But if some additional registers are implicitly used by threads, not reported by “-Xptaxs = -v” then my occupancy will fall down to 50%.

Question - Does “-Xptaxs = -v” report all registers used per thread per kernel function? Or are there some additional by default registers used implicitly per threads which “-Xptaxs = -v” does not report? To determine correct occupancy, I need accurate register usage

Section 2: Register usage

goal of pseudo code : find the index of maximum absolute value in each column (where column size is from k to 256, where int k varies from 0 to 30) of a matrix assigned to that block(not finding the global but just the block local, absolute maximum value). matrix size assigned to that block 256*33(row wise), rows = 256, cols = 33 (33 instead of 32 to avoid bank conflict when accessing columns).

Original Code Structure and Code flow :

__global__ void find_block_pivot_kernel( /* some parameters */ ) {   
    //some code

    float max = 0; int pivot_row_index, index = 0;                    //line_1
    __shared__ float As[256][33]; int ldas = 32;
    const int i = threadIdx.x, j = threadIdx.y;                       

    // copy data from device memory to shared memory
    // some code

    for (int k = 0; k < 31; k++) {                                     
      if (threadIdx.y == k) { //a single warp works on finding max in the column
        max = abs(As[k * (ldas + 1) + k]);
        index = k * (ldas + 1) + k;                                   //line_2
        #pragma unroll
        for (int it = 0; it < 8; it++) {
          if (threadIdx.x + it * 32 < mm && threadIdx.x + it * 32 >= k) {
            float value = abs(As[threadIdx.x * (ldas + 1) + k + it * 32 * (ldas + 1)]);
            if (value > max) {
              max = value;
              index = i * 33 + k + it * 32 * 33;                       //line_3
            }
          }
        }
         int max_vertex = warpAllReduceFindPivotRow(max, index);        //line_4
        // some code being worked upon by other warps
      }
      // copy data from shared memory to device memory
    }
ptxas info    : Used 38 registers, 4352 bytes smem, 408 bytes cmem[0]

If I replace thread local variableint index” with a shared memory arrayint index[32]”, it reduces register usage from 38 registers per thread in the above case to 32 registers per thread.

//replace 'int index' with __shared__ int index[32] in the following lines
__shared__ int index[32];                 //line_1
    for (int k = 0; k < 31; k++) {
      if (threadIdx.y == k) {
        max = abs(As[k * (ldas + 1) + k]);
        index[i] = k * (ldas + 1) + k;     //line_2
        #pragma unroll
        for (int it = 0; it < 8; it++) {
          if (threadIdx.x + it * 32 < mm && threadIdx.x + it * 32 >= k) {
            float value = abs(As[threadIdx.x * (ldas + 1) + k + it * 32 * (ldas + 1)]);
            if (value > max) { // a thread operates at every 8th element separated by 32 elements in a 256 element column. 
              max = value;
              index[i] = i * 33 + k + it * 32 * 33;    //line_3
            }
          }
        }
	int max_vertex = warpAllReduceFindPivotRow(max, index[i]);        //line_4
ptxas info    : Used 32 registers, 4480 bytes smem, 408 bytes cmem[0]

Question : I am not using async copy which skips registers while copying to shared memory. Hence, I am not able to figure out why using a shared memory variable ‘int index[32]’ instead of a thread local variable ‘int index’ reduces register usage ? Can you please let me know why less registers get used and also how do I investigate this?

Scenario : Earlier I was using thread local variable index. but since it led to over-usage of registers(38 per thread) which reduces occupancy to 50%. So, I had to instead use the shared memory variable shared int index[32] which reduces register usage (32 reg per thread) and theoretically should lead to 100% occupancy. but using shared memory variables which are slower then thread local variables(registers) should also slow down the performance.

Question: So how do I decide whether to use more thread local variables (registers, fastest) or shared memory(slower when compared to registers)? how much approximate minimum possible occupancy is acceptable, post which we need to use shared memory variables instead of thread local variables?

Section 3: Divergent warp

Section 3_1 :
Scenario: for a single warp, single precision data

if (value > max) {
      max = value;
      index = i * 33 + k + it * 32 * 33;
    }
\\ threads of a warp, that do not enter if statement, wait here for the rest of the threads to complete if code section

Though process : For the worst possible scenario, suppose for even threadIdx.x of a warp, the ‘if statement is true’ and for odd threadIdx.x threads of the same warp, the ‘if statement’ results false. So, it becomes a divergent code section, since threads follow two different paths. But, in my view even after being divergent this should be full speed and not hurt performance because the threads that did not enter the ‘if code section’ do not have a ‘else code section to enter’. hence, the above piece of code should be as fast as a non divergent code as if all threads of the warp executed the ‘if code section’.

Question : Is my above understanding correct or am I missing something?

Question : If the above divergence within the warp hurts the performance, still I think there is no way to get rid of the branch ‘if statement’ unless we use a ‘conditional ternary operator’. Or is there any way possible to get rid of the branch if statement to make it non divergent?

Section 3_2:

Also, I have some confusion in context to performance between ‘if statement’ and ‘conditional ternary operator’.

if (value > max) {
      max = value;
      index = i * 33 + k + it * 32 * 33;
    }

OR 

(value > max) ? (max = value, index = i * 33 + k + it * 32 * 33): (max = max, index = index);

Question : how to decide which one to prefer over the other in context to better performance? By simply looking at it seems like a conditional ternary operator does not have a branch. So does that make conditional ternary operator non divergent statement? or does ‘conditional ternary operator’ get interpreted as a branch statement(divergent code) and how to investigate this?

Section 4: correct use of loop unroll

#pragma unroll

for (int k =0;k < 50; k++)
{
  if (threadIdx.y == 0)  // line_5
   {
     As[i] = global_A[i];   //line_6
    }  
 } 

i have seen some cuda code do loop unroll in the above fashion. but to do loop unroll should not all the statements in the for loop be independent of each other? Here clearly, line_6 is dependent on line_5 . if line_5 evaluates false, line_6 will not execute. So, in this case there will no loop unrolling even after using ‘#pragma unroll’

Question: IS my understanding correct or am I missing something? please correct me if my thought process is not correct .

Section 5: warm up

So I have to benchmark my code for performance. So, I read warm up is necessary. I have three kernels. So, in my ‘int main’ code, I need to run all three kernels in a loop for benchmarking them

Do I do warm up only once before running my kernels in a for loop for benchmarking?

int main(){

//warm up all three kernels
lunch_kernel_1<<<>>>();
lunch_kernel_2<<<>>>();
lunch_kernel_3<<<>>>();


 //launching kernels for benchmarking
 for(int i = 0 ; i < k ;k++)
 {
   lunch_kernel_1<<<>>>();
   lunch_kernel_2<<<>>>();
   lunch_kernel_3<<<>>>();
 }
}

OR

Do I run the warm up kernel as many times in a for loop as my benchmark kernel?

int main(){

 //launching kernels for benchmarking
 for(int i = 0 ; i < k ;k++)
 {
   //warm up all three kernels
   lunch_kernel_1<<<>>>();
   lunch_kernel_2<<<>>>();
   lunch_kernel_3<<<>>>();

   //launching kernels for benchmarking
   lunch_kernel_1<<<>>>();
   lunch_kernel_2<<<>>>();
   lunch_kernel_3<<<>>>();
 }
}

It reports all of them in the case where you are not compiling with -rdc=true. When you compile with -rdc=true, the kernel report should be correct, but the individual device functions will only report their own register usage. The only other factor to consider is allocation granularity, but for a number like 32, there should be no allocation granularity issue. Allocation granularity means that if the report from nvcc indicates 34 registers needed, and the allocation granularity is 4, then the actual registers used might be 36 (whole-number-divisible by 4).

In the general case, this requires SASS analysis. There are many questions on these forums that discuss analyzing the SASS code.

The final determinant is performance of your code. You should benchmark each case and test/measure performance, if you want to be sure. Prior to that, you should have some confidence that the compiler will try to make choices about register usage with performance as a priority.

Yes, an if without an else clause, especially for the short test and body indicated here, is probably a best case. The compiler uses a GPU function called predication to allow conditional behavior without explicit jumps as you might see in CPU code. This helps in a number of ways. Some general advice I would offer at this point is that this is starting to devolve into low-level optimizations which might be premature. Before spending a lot of time thinking about alternatives for such a simple piece of code, you might want to confirm that the thing you are focusing on is truly a performance bottleneck for your code. That usually would involve using a profiler.

1 Like

Again, this would probably require SASS analysis to confirm, but in general the compiler does not use branching to achieve such conditional behavior, for cases like this. Stressing over use of a ternary operator vs. if…else is probably not a good use of your time. The compiler is likely to realize either approach with a similar realization.

I generally write my code to run a warm-up kernel only once, regardless of how many time I use it. Others might have different views. For example if the memory footprint of your kernel changes drastically from run to run, but you have a sequence of kernel calls that use the same footprint, then running the warm-up for each memory footprint may yield useful information for careful benchmarking.

It is highly likely, but not guaranteed, that a ternary operator compiles to branchless code. The compiler first transform the source code into an abstract internal representation. At that level a simple if-then-else may not be distinguishable from a ternary operator. Both may map to a “select” type idiom.

Striving for branchless code, the compiler can now implement the select idiom in a variety of ways considering the machine-specific operations available to it.

(1) Map to simple arithmetic and logic instructions. Simple examples would be use of min and max instructions, AND-ing or OR-ing with a mask. This is not all that common with GPUs but very common on CPU architectures such as RISC-V.

(2) Use of instruction predication. This is quite common on GPUs, but from looking at much generated code it seems to have been de-emphasized in recent years for reasons I do not know. A similar technique, based on condition codes instead of predicates is commonly used on ARM CPUs.

(3) Use of select-type instructions. The compiler emits code for both sides of a if-then-else or ternary operator, then selects the appropriate result at the end. This is commonly observed in machine code for the GPU, increasingly so in recent years.

(4) Use of conditional move instructions. Very commonly used with x86-64 CPUs. GPUs used to have at least one instruction of this type, but as I recall it was removed several architecture generations ago, presumably because there are now more general select instructions.

(5) Use of a branch. This may be warranted if one side of a two-way decision is much more commonly taken than the other (see also the [[likely]] and [[unlikely]] attributes in C++), or if there is no efficient branchless construct available (this latter scenario is not uncommon with current RISC-V processors)

In general I would recommend writing source code in a natural way and letting the compiler worry about the best way to map to the specific machine instructions available, which may change over time, as GPUs are not subject to binary compatibility requirements. This is no different from writing code for the host, especially so since compilers for most host system processor architectures use the same LLVM framework also used by the CUDA compiler.

The exception to this rule would be a critical optimization issue (e.g. in an innermost loop) paired with an expert GPU programmer capable of analyzing the details of the generated SASS (machine code).

1 Like