Float type performance comparisons

Vakapi · June 25, 2007, 7:00pm

I have been curious about the performance of the different types of floats that are available with CUDA (i.e. float, float1, float2, float3, and float4), and so I have written a simple program to compare performance of the different types. Basically all my program does is create a bunch of random float values, allocate device memory, copy values from host to device, then copy values from device input to device output in global memory.

To be able to compare the different float types, I chose a data size that was a common multiple of 2, 3, and 4, and then had each type copy over the exact same amount of data. In other words, I use 6 float2s to copy over data in my float2kernel, but only 4 float3s to copy over the same size data in my float3kernel. Here is the code for a few of my kernels:

``` __global__ void float2_kernel(float2 *idata, float2 *odata)

{

int index = blockIdx.x * blockDim.x + threadIdx.x;

//access 12 spots in memory for each thread, to be able to compare all types

odata[index].x = idata[index].x;

odata[index].y = idata[index].y; 

odata[index+FL_STRIDE].x = idata[index+FL_STRIDE].x;

odata[index+FL_STRIDE].y = idata[index+FL_STRIDE].y;

odata[index+2*FL_STRIDE].x = idata[index+2*FL_STRIDE].x;

odata[index+2*FL_STRIDE].y = idata[index+2*FL_STRIDE].y;

odata[index+3*FL_STRIDE].x = idata[index+3*FL_STRIDE].x;

odata[index+3*FL_STRIDE].y = idata[index+3*FL_STRIDE].y;

odata[index+4*FL_STRIDE].x = idata[index+4*FL_STRIDE].x;

odata[index+4*FL_STRIDE].y = idata[index+4*FL_STRIDE].y;

odata[index+5*FL_STRIDE].x = idata[index+5*FL_STRIDE].x;

odata[index+5*FL_STRIDE].y = idata[index+5*FL_STRIDE].y;

}

global void float3_kernel(float3 *idata, float3 *odata)

{

int index = blockIdx.x * blockDim.x + threadIdx.x;

//access 12 spots in memory for each thread, to be able to compare all types

odata[index].x = idata[index].x;

odata[index].y = idata[index].y;

odata[index].z = idata[index].z;

odata[index+FL_STRIDE].x = idata[index+FL_STRIDE].x;

odata[index+FL_STRIDE].y = idata[index+FL_STRIDE].y;

odata[index+FL_STRIDE].z = idata[index+FL_STRIDE].z;

odata[index+2*FL_STRIDE].x = idata[index+2*FL_STRIDE].x;

odata[index+2*FL_STRIDE].y = idata[index+2*FL_STRIDE].y;

odata[index+2*FL_STRIDE].z = idata[index+2*FL_STRIDE].z;

odata[index+3*FL_STRIDE].x = idata[index+3*FL_STRIDE].x;

odata[index+3*FL_STRIDE].y = idata[index+3*FL_STRIDE].y;

odata[index+3*FL_STRIDE].z = idata[index+3*FL_STRIDE].z;

}

global void float4_kernel(float4 *idata, float4 *odata)

{

int index = blockIdx.x * blockDim.x + threadIdx.x;

//access 12 spots in memory for each thread, to be able to compare all types

odata[index].x = idata[index].x;

odata[index].y = idata[index].y;

odata[index].z = idata[index].z;

odata[index].w = idata[index].w;

odata[index+FL_STRIDE].x = idata[index+FL_STRIDE].x;

odata[index+FL_STRIDE].y = idata[index+FL_STRIDE].y;

odata[index+FL_STRIDE].z = idata[index+FL_STRIDE].z;

odata[index+FL_STRIDE].w = idata[index+FL_STRIDE].w;

odata[index+2*FL_STRIDE].x = idata[index+2*FL_STRIDE].x;

odata[index+2*FL_STRIDE].y = idata[index+2*FL_STRIDE].y;

odata[index+2*FL_STRIDE].z = idata[index+2*FL_STRIDE].z;

odata[index+2*FL_STRIDE].w = idata[index+2*FL_STRIDE].w;

}

</span>

FL_STRIDE is just the total number of threads for each kernel, and is used to ensure coalesced memory accesses.  I am using 384 threads per block (half the max number of threads per processor) and 32 blocks total (2 blocks per processor).  I didn't try to optimize the code for memory accesses beyond this.

When I ran my test and took the average time for each kernel over 100,000 iterations, I found that performance was pretty similar for all of the types, EXCEPT for the float3, which came in about 5X worse:

[quote]

Datatype       Time (ms)	

float              0.031844	

float1            0.031834  

float2            0.030526	

float3            0.159957	

float4            0.036263

[/quote]

My question is this:  <b>is this the relative performance I should expect from a float3 type when compared to the other types?</b>  I looked at the ptx code for each of the kernels, and noticed that while the float2 and the float4 both have vector versions of load and store (ld.global.v4.f32 for example), there is no ld.global.v3.f32 for the float3 kernel.  Instead, it uses single float loads and stores.  Perhaps this would explain why float2 and float4 perform much better than float3, but it does not explain why the float and float1 types perform better than the float3.

I have read elsewhere on the forum that the float3 and float4 types are buggy.  Is this what I am seeing, or does it have more to do with the different sizes of each type?

I am using version 0.9 of CUDA, Visual Studio 8 on Windows XP SP2, Intel Xeon, 8800GTX...

Thanks,

John

p.s.  I found the same results when running the test with int types instead of float types.

paulius · June 25, 2007, 8:22pm

int and float types should give the same results for your kernel, as they both are 32-bit.

I belive that with float3 addressing you violate coalescing restrictions (Global Memory subsection in the Performance Guidelines chapter of the Programming Guide). The reason is that float3 is 12 bytes, which is not one of 4, 8, or 16 (as required). So, what happens is that each thread issues three separate float reads to read the components of a float3. However, thread tid now accesses 32-bit word at address HalfWarpBaseAddress + (3*tid)*sizeof(float). In order to get coalescing, it should be HalfWarpBaseAddress + tid*sizeof(float), which is what you get when you’re testing for float/float1. In general, you’ll get coalescing when float is replaced with any type that’s 4,8,or 16 bytes in size.

Let me know if this makes sense. Actually, your example is a good illustration why coalescing is important.

Paulius

vpodlozhnyuk · June 25, 2007, 9:11pm

I have been curious about the performance of the different types of floats that are available with CUDA (i.e. float, float1, float2, float3, and float4), and so I have written a simple program to compare performance of the different types. Basically all my program does is create a bunch of random float values, allocate device memory, copy values from host to device, then copy values from device input to device output in global memory.

To be able to compare the different float types, I chose a data size that was a common multiple of 2, 3, and 4, and then had each type copy over the exact same amount of data. In other words, I use 6 float2s to copy over data in my float2kernel, but only 4 float3s to copy over the same size data in my float3kernel. Here is the code for a few of my kernels:

``` __global__ void float2_kernel(float2 *idata, float2 *odata)

{

int index = blockIdx.x * blockDim.x + threadIdx.x;

//access 12 spots in memory for each thread, to be able to compare all types

odata[index].x = idata[index].x;

odata[index].y = idata[index].y; 

odata[index+FL_STRIDE].x = idata[index+FL_STRIDE].x;

odata[index+FL_STRIDE].y = idata[index+FL_STRIDE].y;

odata[index+2*FL_STRIDE].x = idata[index+2*FL_STRIDE].x;

odata[index+2*FL_STRIDE].y = idata[index+2*FL_STRIDE].y;

odata[index+3*FL_STRIDE].x = idata[index+3*FL_STRIDE].x;

odata[index+3*FL_STRIDE].y = idata[index+3*FL_STRIDE].y;

odata[index+4*FL_STRIDE].x = idata[index+4*FL_STRIDE].x;

odata[index+4*FL_STRIDE].y = idata[index+4*FL_STRIDE].y;

odata[index+5*FL_STRIDE].x = idata[index+5*FL_STRIDE].x;

odata[index+5*FL_STRIDE].y = idata[index+5*FL_STRIDE].y;

}

global void float3_kernel(float3 *idata, float3 *odata)

{

int index = blockIdx.x * blockDim.x + threadIdx.x;

//access 12 spots in memory for each thread, to be able to compare all types

odata[index].x = idata[index].x;

odata[index].y = idata[index].y;

odata[index].z = idata[index].z;

odata[index+FL_STRIDE].x = idata[index+FL_STRIDE].x;

odata[index+FL_STRIDE].y = idata[index+FL_STRIDE].y;

odata[index+FL_STRIDE].z = idata[index+FL_STRIDE].z;

odata[index+2*FL_STRIDE].x = idata[index+2*FL_STRIDE].x;

odata[index+2*FL_STRIDE].y = idata[index+2*FL_STRIDE].y;

odata[index+2*FL_STRIDE].z = idata[index+2*FL_STRIDE].z;

odata[index+3*FL_STRIDE].x = idata[index+3*FL_STRIDE].x;

odata[index+3*FL_STRIDE].y = idata[index+3*FL_STRIDE].y;

odata[index+3*FL_STRIDE].z = idata[index+3*FL_STRIDE].z;

}

global void float4_kernel(float4 *idata, float4 *odata)

{

int index = blockIdx.x * blockDim.x + threadIdx.x;

//access 12 spots in memory for each thread, to be able to compare all types

odata[index].x = idata[index].x;

odata[index].y = idata[index].y;

odata[index].z = idata[index].z;

odata[index].w = idata[index].w;

odata[index+FL_STRIDE].x = idata[index+FL_STRIDE].x;

odata[index+FL_STRIDE].y = idata[index+FL_STRIDE].y;

odata[index+FL_STRIDE].z = idata[index+FL_STRIDE].z;

odata[index+FL_STRIDE].w = idata[index+FL_STRIDE].w;

odata[index+2*FL_STRIDE].x = idata[index+2*FL_STRIDE].x;

odata[index+2*FL_STRIDE].y = idata[index+2*FL_STRIDE].y;

odata[index+2*FL_STRIDE].z = idata[index+2*FL_STRIDE].z;

odata[index+2*FL_STRIDE].w = idata[index+2*FL_STRIDE].w;

}

</span>

FL_STRIDE is just the total number of threads for each kernel, and is used to ensure coalesced memory accesses.  I am using 384 threads per block (half the max number of threads per processor) and 32 blocks total (2 blocks per processor).  I didn't try to optimize the code for memory accesses beyond this.

When I ran my test and took the average time for each kernel over 100,000 iterations, I found that performance was pretty similar for all of the types, EXCEPT for the float3, which came in about 5X worse:

My question is this:  <b>is this the relative performance I should expect from a float3 type when compared to the other types?</b>  I looked at the ptx code for each of the kernels, and noticed that while the float2 and the float4 both have vector versions of load and store (ld.global.v4.f32 for example), there is no ld.global.v3.f32 for the float3 kernel.  Instead, it uses single float loads and stores.  Perhaps this would explain why float2 and float4 perform much better than float3, but it does not explain why the float and float1 types perform better than the float3.

I have read elsewhere on the forum that the float3 and float4 types are buggy.  Is this what I am seeing, or does it have more to do with the different sizes of each type?

I am using version 0.9 of CUDA, Visual Studio 8 on Windows XP SP2, Intel Xeon, 8800GTX...

Thanks,

John

p.s.  I found the same results when running the test with int types instead of float types.

[snapback]214197[/snapback]

With such tiny execution time per iteration API overhead to actual compute time ratio can be very high, depending on your OS/CPU/chipset. Problem size should be increased to more or less reasonable size, to have at least a couple of milliseconds per iteration for correct memory bandwidth measurement.

CUDA SDK 1.0 contains alignedTypes sample demonstrating memory performance for different types, with and without align keyword in type definition.

float3 problem is a known issue. Even if 3-component float struct is defined with align(16) option, the compiler emits 64-bit + 32-bit global load instructions instead of single 128-bit global load, thus making float3 accesses non-coalescable. The workaround is to use 4-component structures.