To avoid warp divergence

When using tensor core, there are always some padding conditions to deal with. I need to load data from smem for normal condition and make the register value equals to 0 for padding condition. Is there any way to avoid warp divergence? Because the condition may be different between threads.

Either have a suitable address (suitable: no bank conflicts except if several threads read the same address) with value 0 stored in shared memory and read a 0 from there,

or use a select operation (typically with the ternary ?operator in C/C++) after reading any suitable adress with any value (the select sets the output to the read value or to 0),

or use a min operation with bitcasting or reinterpret_casting to an unsigned int or unsigned short vector after reading any suitable address (min with 0 sets to 0, min with 0xFF… keeps the value,

or use an AND operation similarly.

There will be not much divergence nor many additional instructions (and those will be arithmetic and fast), as long as you can efficiently determine or calculate the condition that padding is needed. E.g. try to avoid memory operations just for knowing the padding.

I don’t understand tenary method well. Does that means:

val = (addr >= 0) ? smem[addr] : 0;

Yes, exactly.

It is typically translated as

set variable to zero.
set predicate to condition.
load memory into register (activated only for lanes with true predicate)

So it costs two arithmetic operations, which normally are cheap, when shared memory is involved. You often have 2 arithmetic integer operations/cycle/SM and 1 shared memory/cycle/SM. Even more so for global memory.

Further small operations could be possible:
If you load in a loop, you could unpack the loop and know that could happen only in the first iteration.

But those arithmetic operations seldom are the bottleneck, especially if those are on integer and your actual computations are on tensor core or FP32.

Sometimes it can be optimized further:

val = smem[addr];
val2 += (addr >= 0) ? val : 0;

load shared mem into register
set condition in predicate register
Do addition (if predicate is true)

You only have one additional instruction

That only works, if you either pad smem in the beginning with additional values or put there some other shared memory arrays, which can be safely read, even if addr is negative.

(perhaps you have to reformulate the code with if instead of ternary for that optimized translation to materialize by the compiler)

Thanks!

Also I have another question about the mechanism of threads in a single warp. If I load data in uint4 way, what will happened for all 32 threads? Will they being activated together and load 512B data in a cycle?

(not related, see an updated trick at the end of the answer above, which I edited.)

Reading UINT4 from shared memory means that the addresses have to be aligned at 16 byte boundaries.

Further it takes 4 times as long (4 cycles). I think even in the case, if most of the lanes would not participate.

You can have 4x bank conflicts. When reading with 32 lanes and alignment to 16 bytes, the bank conflicts are unavoidable anyway, but it is nice that we can choose, which threads have them.

The overhead (number of transactions and number of instructions) is slightly lower, which can give a few percent indirect speed-up.

By the way:

If you e.g. load 96 values with 32 threads altogether, but with four iterations (so actually 128 set) and want to do padding with 0 in the beginning and end, then better combine the load operation of the first and fourth iteration.

And divide the loaded values unto two variables afterwards. This needs slightly more instructions for setting the address and setting the variables, but you save one access to shared memory, which is typically more valuable for optimizing performance.

Or alternatively do it the following way:

int offset = 6; // 0..31
int idx = threadIdx.x + (threadIdx.x < offset) * 32;

for (int i = 0; i < 3; i++) {
   smload[idx];
   // process
   idx += 32;
} 

0 is no longer the smallest idx.
You better use the load operation and fully utilize the computations. No empty data.
One iteration less.