Hi all,
I did some performances evaluations about sorts and log2 on GPU for my final MSc Thesis. These results help me to see what I can implement in CUDA to improve the performance of my program. I think that It could be useful for some of you so I post my results here :
sorts using a GeForce 8800GTS :
http://www.ab2patrimoine.be/adrien/upload/…/graphsort1.jpg
sorts using a GeForce 8600m GS :
http://www.ab2patrimoine.be/adrien/upload/…/graphsort2.jpg
log2 CPU vs log2 GPU using a GeForce 8800 GTS :
http://www.ab2patrimoine.be/adrien/upload/…s/log2-8800.jpg
log2 CPU vs log2 GPU using a GeForce 8600m GS :
http://www.ab2patrimoine.be/adrien/upload/…ers/log2low.jpg
log2 coalesced memory vs log2 uncoalesced memory :
http://www.ab2patrimoine.be/adrien/upload/…escedmemory.jpg
I don’t understand why the difference between coalesced memory and uncoalesced memory is so small (in GPU Gems 3 and other papers, the difference is impressive)… Does someone have an explication?
The code used to test uncoa. vs coa. :
// Perform log2 operations on a float3 vector
__global__ void d_log2_uncoalesced (float3* d_data)
{
uint i = blockDim.x * blockIdx.x + threadIdx.x;
float3 data = d_data[i];
data.x = log2f(data.x);
data.y = log2f(data.y);
data.z = log2f(data.z);
d_data[i] = data;
}
__global__ void d_log2_coalesced (float* d_data)
{
int index = 3 * blockIdx.x * blockDim.x + threadIdx.x;
__shared__ float s_data[BLOCK_DIM*3];
s_data[threadIdx.x] = d_data[index];
s_data[threadIdx.x + blockDim.x] = d_data[index + blockDim.x];
s_data[threadIdx.x + blockDim.x * 2] = d_data[index + blockDim.x * 2];
__syncthreads();
float3 data = ((float3*)s_data)[threadIdx.x];
data.x = log2f(data.x);
data.y = log2f(data.y);
data.z = log2f(data.z);
((float3*)s_data)[threadIdx.x] = data;
__syncthreads();
d_data[index] = s_data[threadIdx.x];
d_data[index + blockDim.x] = s_data[threadIdx.x + blockDim.x];
d_data[index + blockDim.x * 2] = s_data[threadIdx.x + blockDim.x * 2];
}
The coalesced version seems to be optimal…