Why do I never get 128 bit reads from global memory?

Profiling my code reveals that I get a lot of 64 bit reads from global memory, but no 128 bit reads. I work with 3D data and use 32 threads along x and 16 threads along y. Since my data is floats and the data is stored as x, y, z, I should get reads that are 32 * 4 bytes, but apparently I don’t get any 128 bit reads. Can anyone explain why?

In general, 128 bit reads are used for built-in 128-bit types such as float4 and double2. However, there may be reasons for the compiler not to generate 128-bit loads:

(1) On older platforms, I think sm_1x in general (my memory is hazy), the memory throughput of 128-bit loads can actually be lower than when using 64-bit loads.

(2) If some components of a 128-bit data type are unused (e.g. the code only uses the x, y, z components of a float4), the compiler may determine that the “overfetch” due to the 128=-bit load actually lowers the performance.

(3) Use of 128-bit loads leads to higher register pressure, and thus may conflict with aggressive low register bounds provided via the -maxrregcount compiler flag or the launch_bounds function attribute.

(4) Since data must be aligned according to size (n-byte data aligned on n-byte boundary), the compiler may need to use narrower loads if that alignment cannot be guaranteed. From the early days of CUDA I have vague recollection that 16-byte alignment could not be guaranteed on Win32 platforms with MSVC.

Note that many compiler decisions are driven by heuristics. It is the nature of heuristics to deliver an “optimal” outcome most of the time, but there could certainly be situations where the compiler makes a suboptimal choice. The mere absence of 128-bit loads in the generated code per-se is not an indication of any performance issues. If, after careful analysis, you find that use of 128-bit loads would deliver a noticeable performance increases for your application, I would suggest filing a bug.