Difference between float[n] and float* Are there any performance differences?

achinda99 · May 26, 2010, 8:12pm

While working on some CUDA and OpenCL projects, I was wondering if there are any performance improvements or reasons to use say float3 instead of float* and using an array subscript? I did a quick search through the forums and the net, but its a bit of a vague topic to search for and I also didn’t see anything in the documentation.

Which do you use, and why?

seibert · May 26, 2010, 11:01pm

float3 is an interesting case because the underlying hardware can issue memory reads of size 32, 64, and 128 bits. Since a float3 is 96 bits, you can easily end up with suboptimal bandwidth because something that looks coalesced:

float3 value = input_data[blockIdx.x * blockDim.x + threadIdx.x];

Actually gets turned into two reads for each thread, a 64-bit followed by an offset 32-bit read. However, the problem is that the 64-bit reads and the 32-bit reads issued in a warp do not read from contiguous blocks of memory anymore. The 64-bit read will read from words 0, 1, 3, 4, 6, 7, etc and the following 32-bit read will read from words 2, 5, 8, etc… As a result, you do not reach the same memory bandwidth as you would for float, float2, or float4. In the case of float3, it is faster in fact to read contiguous blocks from an array of floats instead. If you need to then merge the block of floats into float3 data elements for each thread, you can use shared memory as a staging area to do that. It is quite possible (though I haven’t checked yet) that the L1/L2 cache on Fermi fixes this problem and float3 reads proceed at the same speed as the rest.

Otherwise, except for the original compute 1.0 (and 1.1?) devices where the memory controller was not so fast with float4, there is no significant memory bandwidth advantage to float vs float2 vs. float4.

seibert · May 26, 2010, 11:01pm

float3 is an interesting case because the underlying hardware can issue memory reads of size 32, 64, and 128 bits. Since a float3 is 96 bits, you can easily end up with suboptimal bandwidth because something that looks coalesced:

float3 value = input_data[blockIdx.x * blockDim.x + threadIdx.x];

Actually gets turned into two reads for each thread, a 64-bit followed by an offset 32-bit read. However, the problem is that the 64-bit reads and the 32-bit reads issued in a warp do not read from contiguous blocks of memory anymore. The 64-bit read will read from words 0, 1, 3, 4, 6, 7, etc and the following 32-bit read will read from words 2, 5, 8, etc… As a result, you do not reach the same memory bandwidth as you would for float, float2, or float4. In the case of float3, it is faster in fact to read contiguous blocks from an array of floats instead. If you need to then merge the block of floats into float3 data elements for each thread, you can use shared memory as a staging area to do that. It is quite possible (though I haven’t checked yet) that the L1/L2 cache on Fermi fixes this problem and float3 reads proceed at the same speed as the rest.

Otherwise, except for the original compute 1.0 (and 1.1?) devices where the memory controller was not so fast with float4, there is no significant memory bandwidth advantage to float vs float2 vs. float4.

achinda99 · May 27, 2010, 5:42pm

Thanks, that really did give me some good insight. From a programmer’s POV, whats preferred?

seibert · May 27, 2010, 7:02pm

I try to work with whatever struct size moves a unit of data the thread naturally operates on, as long as it is 32, 64, or 128 bits wide. It tends to be easier to read. If the data size does map onto one of those units, then I use the standard trick of converting an array of structs into a struct of arrays. Then coalesced reading is also easy, and does not require shared memory or __syncthreads(). For example, you would turn this:

float3 *points;

into this

struct {

  float *x;

  float *y;

  float *z;

} points;

(I’m playing fast and loose with the struct syntax, but you get the idea. You have to allocate all three of these arrays with cudaMalloc() separately.)

Then instead of points[i].x, you access points.x[i], and as long as i takes consecutive values in adjacent threads, you’ll get perfect coalescing.

achinda99 · May 27, 2010, 7:08pm

Well, I was thinking on the much more simpler scale.

Using,

float3 dimension = {10, 4, 2};

versus

float *dimension = malloc(3 * sizeof(float));

dimension[0] = 10;

dimension[1] = 4;

dimension[2] = 2;

And then passing it to the device in global memory. Which is better practice?

jma · May 27, 2010, 10:13pm

That was an almost invisibly small amount of data??

seibert · May 27, 2010, 10:46pm

On this scale, “best practice” is irrelevant. Do anything that is short and easy to read since it won’t matter at all to performance. :)

Topic		Replies	Views
Is float3 as fast as float4? CUDA Programming and Performance	11	748	July 16, 2024
float3-array versus 3 float-arrays in shared memory? CUDA Programming and Performance	4	11178	October 12, 2009
Float type performance comparisons CUDA Programming and Performance	2	5331	June 25, 2007
why float4 in vbo generation kernels CUDA Programming and Performance	1	1014	February 10, 2010
Bug in float3* addressing CUDA Programming and Performance	2	3262	May 17, 2012
float3 support details witch operations does float3 support CUDA Programming and Performance	8	34630	June 2, 2009
__align__ structure CUDA Programming and Performance	4	18757	May 24, 2008
Memory access coalescing Vs. the compiler CUDA Programming and Performance	2	2494	July 23, 2007
rtBuffer<float4, 3> vs rtBuffer<float, 3> access performance OptiX	1	960	June 14, 2022
Coalesced Memory Read Question CUDA Programming and Performance	7	3267	February 24, 2016

Difference between float[n] and float* Are there any performance differences?

Related topics