On A100, I try to use ldmatrix instruction to load data. I notice following sentence in 1. Introduction — PTX ISA 8.7 documentation
" For .target sm_75
or below, all threads must contain valid addresses. Otherwise, the behavior is undefined. For .num = .x1
and .num = .x2
, addresses contained in lower threads can be copied to higher threads to achieve the expected behavior."
I wonder what will gpu do for invalid address on sm_80? Will it just padding with 0?
For example, I may need to load data from smem[addr] when by calculation, addr is positive, otherwise the data will be paded with 0. Can I just use addr as an parameter of ldmatrix to avoid the branch instruction?
ldmatrix with .x1 or .x2 loads less data and therefore needs less adresses to load from.
There is no padding with 0.
On sm_80 the addresses, which are not used, can be invalid. The addresses, which are used, have to be valid.
So there is no feature to exploit for invalid addresses.
Cuda has the select instruction so there would not be an actual branch. You can store 0 values in shared memory and change your address to that location, if the index is negative.
Thanks! I will try to use tenery