Why is the default load size is lds.32 instead of lds.128?

  1. That we use mostly arrays inside the kernels, why does the default load is 32 bit wide. One of the explicit optimization technique is to hint compiler explicitly for 128 bit load (using for example FLOAT4). Why does not CUDA follow default load to 128 bit wide memory loads as it gives significant performance gain?

  2. Is there a way to “enforce” 128 bit loads using a compiler flag for example?

The possibility to use a 128 bit load will be first and foremost conditioned by the source code you write. Considering it in absence of any code, or as if the compiler has a free choice to use a 128 bit load vs. a 32 bit load in any and all situations is not sensible.

If you have created source code (e.g. using float4) that could be done using a 128-bit load, the compiler will/generally should use a 128-bit load, and there are various questions on these forums considering various cases.

The compiler will generally use a 128-bit load if it can determine that it is permissible to do so. If it does not, that often means that some aspect of your source code (e.g. use of structures without necessary explicit alignment guarantees) does not allow it to make the determination that it is guaranteed to be permissible.

The compiler can only use a 128-bit load when it believes the address handed to it will be properly naturally aligned. It is illegal in hardware to attempt to do a 128-bit load on a non-naturally-aligned address (i.e. an address that does not fall on a 16-byte boundary.)

There are a number of forum questions discussing specific cases.

There is no compiler flag to “force” 128-bit loads, whatever you may mean by that.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.