Shared memory and vectors


Is it possible to load a float4 vector from shared memory into four registers in a single clock cycle? In the PTX manual it looks like ld.shared.v4.f32 is a valid instruction, but I’m not sure whether this takes 1 or 4 clock cycles.


Yes, loading a float4 takes a single instruction issue. In fact, any structures that are of size 1, 2, or 4 bytes (and are stored at addresses that are multiples of the respective sizes) can be loaded with singe issue. See Section of the Programming Guide for details.


Paulius, this is wrong. While you can load a float4 from global mem to registers with one vector instruction, you cannot do that from shared mem (that was the question) as the 4 floats will live in different banks.