While working on some CUDA and OpenCL projects, I was wondering if there are any performance improvements or reasons to use say float3 instead of float* and using an array subscript? I did a quick search through the forums and the net, but its a bit of a vague topic to search for and I also didn’t see anything in the documentation.
float3 is an interesting case because the underlying hardware can issue memory reads of size 32, 64, and 128 bits. Since a float3 is 96 bits, you can easily end up with suboptimal bandwidth because something that looks coalesced:
float3 value = input_data[blockIdx.x * blockDim.x + threadIdx.x];
Actually gets turned into two reads for each thread, a 64-bit followed by an offset 32-bit read. However, the problem is that the 64-bit reads and the 32-bit reads issued in a warp do not read from contiguous blocks of memory anymore. The 64-bit read will read from words 0, 1, 3, 4, 6, 7, etc and the following 32-bit read will read from words 2, 5, 8, etc… As a result, you do not reach the same memory bandwidth as you would for float, float2, or float4. In the case of float3, it is faster in fact to read contiguous blocks from an array of floats instead. If you need to then merge the block of floats into float3 data elements for each thread, you can use shared memory as a staging area to do that. It is quite possible (though I haven’t checked yet) that the L1/L2 cache on Fermi fixes this problem and float3 reads proceed at the same speed as the rest.
Otherwise, except for the original compute 1.0 (and 1.1?) devices where the memory controller was not so fast with float4, there is no significant memory bandwidth advantage to float vs float2 vs. float4.
float3 is an interesting case because the underlying hardware can issue memory reads of size 32, 64, and 128 bits. Since a float3 is 96 bits, you can easily end up with suboptimal bandwidth because something that looks coalesced:
float3 value = input_data[blockIdx.x * blockDim.x + threadIdx.x];
Actually gets turned into two reads for each thread, a 64-bit followed by an offset 32-bit read. However, the problem is that the 64-bit reads and the 32-bit reads issued in a warp do not read from contiguous blocks of memory anymore. The 64-bit read will read from words 0, 1, 3, 4, 6, 7, etc and the following 32-bit read will read from words 2, 5, 8, etc… As a result, you do not reach the same memory bandwidth as you would for float, float2, or float4. In the case of float3, it is faster in fact to read contiguous blocks from an array of floats instead. If you need to then merge the block of floats into float3 data elements for each thread, you can use shared memory as a staging area to do that. It is quite possible (though I haven’t checked yet) that the L1/L2 cache on Fermi fixes this problem and float3 reads proceed at the same speed as the rest.
Otherwise, except for the original compute 1.0 (and 1.1?) devices where the memory controller was not so fast with float4, there is no significant memory bandwidth advantage to float vs float2 vs. float4.
I try to work with whatever struct size moves a unit of data the thread naturally operates on, as long as it is 32, 64, or 128 bits wide. It tends to be easier to read. If the data size does map onto one of those units, then I use the standard trick of converting an array of structs into a struct of arrays. Then coalesced reading is also easy, and does not require shared memory or __syncthreads(). For example, you would turn this:
float3 *points;
into this
struct {
float *x;
float *y;
float *z;
} points;
(I’m playing fast and loose with the struct syntax, but you get the idea. You have to allocate all three of these arrays with cudaMalloc() separately.)
Then instead of points[i].x, you access points.x[i], and as long as i takes consecutive values in adjacent threads, you’ll get perfect coalescing.