Shared memory and vectors

alex.barnell · July 11, 2007, 2:18pm

Hi,

Is it possible to load a float4 vector from shared memory into four registers in a single clock cycle? In the PTX manual it looks like ld.shared.v4.f32 is a valid instruction, but I’m not sure whether this takes 1 or 4 clock cycles.

Thanks,
Alex

paulius · July 11, 2007, 7:29pm

Yes, loading a float4 takes a single instruction issue. In fact, any structures that are of size 1, 2, or 4 bytes (and are stored at addresses that are multiples of the respective sizes) can be loaded with singe issue. See Section 5.1.2.1 of the Programming Guide for details.

Paulius

prkipfer · July 12, 2007, 9:46am

Paulius, this is wrong. While you can load a float4 from global mem to registers with one vector instruction, you cannot do that from shared mem (that was the question) as the 4 floats will live in different banks.

Peter

Topic		Replies	Views
float4 in a register? CUDA Programming and Performance	4	2035	February 5, 2015
Float4 register write to shared has limit? CUDA Programming and Performance	3	415	July 27, 2022
Use vector load data from global mem to shm CUDA Programming and Performance kernel	1	249	April 5, 2024
Shared memory bank conflict reordering CUDA Programming and Performance	1	103	July 16, 2025
Register / Shared memory question memory copy max performance CUDA Programming and Performance	6	8257	September 13, 2009
How to use vector loads in C for CUDA? CUDA Programming and Performance	0	4559	December 31, 2009
structures in the shared memory CUDA Programming and Performance	1	8865	April 2, 2007
FLOAT4 shared memory access: Do banks 0-3 get occupied simultaneously by a single thread? CUDA Programming and Performance cuda , kernel	6	143	August 31, 2025
Reading from global memory to registers in a fast way CUDA Programming and Performance	10	2327	November 15, 2021
2D float matrix x vector: global vs. shared memory: CUDA Programming and Performance	1	601	October 1, 2018

Shared memory and vectors

Related topics