Shared memory and vectors

Hi,

Is it possible to load a float4 vector from shared memory into four registers in a single clock cycle? In the PTX manual it looks like ld.shared.v4.f32 is a valid instruction, but I’m not sure whether this takes 1 or 4 clock cycles.

Thanks,
Alex

Yes, loading a float4 takes a single instruction issue. In fact, any structures that are of size 1, 2, or 4 bytes (and are stored at addresses that are multiples of the respective sizes) can be loaded with singe issue. See Section 5.1.2.1 of the Programming Guide for details.

Paulius

Paulius, this is wrong. While you can load a float4 from global mem to registers with one vector instruction, you cannot do that from shared mem (that was the question) as the 4 floats will live in different banks.

Peter