rtBuffer<float4, 3> vs rtBuffer<float, 3> access performance

Izy · August 14, 2014, 5:22pm

I’ve been reading up on memory coalescing and was curious if accessing data as a rtBuffer<float4, 3> vs rtBuffer<float, 3> would have any noticeable performance difference. From first inspection I would think that asking for a single index from the float4 buffer would basically grab all 4 values at once requiring less memory requests. However is the compiler smart enough to grab as many consecutive memory addresses as possible in the float case? I know testing it out (which I’m about to do) is the concrete answer but I’d also like to gain insight on the conceptual level as well.

uint3 index2  = make_uint3( launch_index.x, launch_index.y, 0 );
for (size_t i = 0; i < numfloats; ++i) {
   float_buffer[index2] += 1.f;
   ++index2.z;
}

vs.

uint3 index  = make_uint3( launch_index.x, launch_index.y, 0 );
for (size_t i = 0; i < float4Groups; ++i) {
   float4_buffer[index] += make_float4(1.f, 1.f, 1.f, 1.f);
   ++index.z;
}

-edit-
So on my hardware the float loop takes about 3210 clock cycles and 1978 in the float4. So that’s about 38% less clock cycles. I guess that is a pretty good performance boost.

Topic		Replies	Views
float4 in a register? CUDA Programming and Performance	4	2049	February 5, 2015
Coalesced VBO Access CUDA Programming and Performance	14	1900	February 4, 2011
Coalesced vs non-coalesced in reduction example Why float4-reads are not coalesced? CUDA Programming and Performance	10	4235	October 15, 2008
Is float3 as fast as float4? CUDA Programming and Performance	11	748	July 16, 2024
Float type performance comparisons CUDA Programming and Performance	2	5331	June 25, 2007
Texture Memory vs. Global Memory and float4 CUDA Programming and Performance	5	1945	November 1, 2010
float3-array versus 3 float-arrays in shared memory? CUDA Programming and Performance	4	11178	October 12, 2009
Difference between float[n] and float* Are there any performance differences? CUDA Programming and Performance	7	5992	May 27, 2010
32 byte coalesced access is faster than 128 byte coalesced access? CUDA Programming and Performance	3	1168	October 12, 2021
Converting vector types to arrays CUDA Programming and Performance	4	3352	March 26, 2009

rtBuffer<float4, 3> vs rtBuffer<float, 3> access performance

Related topics