I have been curious about the performance of the different types of floats that are available with CUDA (i.e. float, float1, float2, float3, and float4), and so I have written a simple program to compare performance of the different types. Basically all my program does is create a bunch of random float values, allocate device memory, copy values from host to device, then copy values from device input to device output in global memory.
To be able to compare the different float types, I chose a data size that was a common multiple of 2, 3, and 4, and then had each type copy over the exact same amount of data. In other words, I use 6 float2s to copy over data in my float2kernel, but only 4 float3s to copy over the same size data in my float3kernel. Here is the code for a few of my kernels:
``` __global__ void float2_kernel(float2 *idata, float2 *odata){
int index = blockIdx.x * blockDim.x + threadIdx.x;
//access 12 spots in memory for each thread, to be able to compare all types
odata[index].x = idata[index].x;
odata[index].y = idata[index].y;
odata[index+FL_STRIDE].x = idata[index+FL_STRIDE].x;
odata[index+FL_STRIDE].y = idata[index+FL_STRIDE].y;
odata[index+2*FL_STRIDE].x = idata[index+2*FL_STRIDE].x;
odata[index+2*FL_STRIDE].y = idata[index+2*FL_STRIDE].y;
odata[index+3*FL_STRIDE].x = idata[index+3*FL_STRIDE].x;
odata[index+3*FL_STRIDE].y = idata[index+3*FL_STRIDE].y;
odata[index+4*FL_STRIDE].x = idata[index+4*FL_STRIDE].x;
odata[index+4*FL_STRIDE].y = idata[index+4*FL_STRIDE].y;
odata[index+5*FL_STRIDE].x = idata[index+5*FL_STRIDE].x;
odata[index+5*FL_STRIDE].y = idata[index+5*FL_STRIDE].y;
}
global void float3_kernel(float3 *idata, float3 *odata)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
//access 12 spots in memory for each thread, to be able to compare all types
odata[index].x = idata[index].x;
odata[index].y = idata[index].y;
odata[index].z = idata[index].z;
odata[index+FL_STRIDE].x = idata[index+FL_STRIDE].x;
odata[index+FL_STRIDE].y = idata[index+FL_STRIDE].y;
odata[index+FL_STRIDE].z = idata[index+FL_STRIDE].z;
odata[index+2*FL_STRIDE].x = idata[index+2*FL_STRIDE].x;
odata[index+2*FL_STRIDE].y = idata[index+2*FL_STRIDE].y;
odata[index+2*FL_STRIDE].z = idata[index+2*FL_STRIDE].z;
odata[index+3*FL_STRIDE].x = idata[index+3*FL_STRIDE].x;
odata[index+3*FL_STRIDE].y = idata[index+3*FL_STRIDE].y;
odata[index+3*FL_STRIDE].z = idata[index+3*FL_STRIDE].z;
}
global void float4_kernel(float4 *idata, float4 *odata)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
//access 12 spots in memory for each thread, to be able to compare all types
odata[index].x = idata[index].x;
odata[index].y = idata[index].y;
odata[index].z = idata[index].z;
odata[index].w = idata[index].w;
odata[index+FL_STRIDE].x = idata[index+FL_STRIDE].x;
odata[index+FL_STRIDE].y = idata[index+FL_STRIDE].y;
odata[index+FL_STRIDE].z = idata[index+FL_STRIDE].z;
odata[index+FL_STRIDE].w = idata[index+FL_STRIDE].w;
odata[index+2*FL_STRIDE].x = idata[index+2*FL_STRIDE].x;
odata[index+2*FL_STRIDE].y = idata[index+2*FL_STRIDE].y;
odata[index+2*FL_STRIDE].z = idata[index+2*FL_STRIDE].z;
odata[index+2*FL_STRIDE].w = idata[index+2*FL_STRIDE].w;
}
</span>
FL_STRIDE is just the total number of threads for each kernel, and is used to ensure coalesced memory accesses. I am using 384 threads per block (half the max number of threads per processor) and 32 blocks total (2 blocks per processor). I didn't try to optimize the code for memory accesses beyond this.
When I ran my test and took the average time for each kernel over 100,000 iterations, I found that performance was pretty similar for all of the types, EXCEPT for the float3, which came in about 5X worse:
[quote]
Datatype Time (ms)
float 0.031844
float1 0.031834
float2 0.030526
float3 0.159957
float4 0.036263
[/quote]
My question is this: <b>is this the relative performance I should expect from a float3 type when compared to the other types?</b> I looked at the ptx code for each of the kernels, and noticed that while the float2 and the float4 both have vector versions of load and store (ld.global.v4.f32 for example), there is no ld.global.v3.f32 for the float3 kernel. Instead, it uses single float loads and stores. Perhaps this would explain why float2 and float4 perform much better than float3, but it does not explain why the float and float1 types perform better than the float3.
I have read elsewhere on the forum that the float3 and float4 types are buggy. Is this what I am seeing, or does it have more to do with the different sizes of each type?
I am using version 0.9 of CUDA, Visual Studio 8 on Windows XP SP2, Intel Xeon, 8800GTX...
Thanks,
John
p.s. I found the same results when running the test with int types instead of float types.