CUDA, performance evaluation

Hi all,

I did some performances evaluations about sorts and log2 on GPU for my final MSc Thesis. These results help me to see what I can implement in CUDA to improve the performance of my program. I think that It could be useful for some of you so I post my results here :

sorts using a GeForce 8800GTS :

http://www.ab2patrimoine.be/adrien/upload/…/graphsort1.jpg

sorts using a GeForce 8600m GS :

http://www.ab2patrimoine.be/adrien/upload/…/graphsort2.jpg

log2 CPU vs log2 GPU using a GeForce 8800 GTS :

http://www.ab2patrimoine.be/adrien/upload/…s/log2-8800.jpg

log2 CPU vs log2 GPU using a GeForce 8600m GS :

http://www.ab2patrimoine.be/adrien/upload/…ers/log2low.jpg

log2 coalesced memory vs log2 uncoalesced memory :

http://www.ab2patrimoine.be/adrien/upload/…escedmemory.jpg

I don’t understand why the difference between coalesced memory and uncoalesced memory is so small (in GPU Gems 3 and other papers, the difference is impressive)… Does someone have an explication?

The code used to test uncoa. vs coa. :

// Perform log2 operations on a float3 vector

__global__  void d_log2_uncoalesced (float3* d_data)

{

	uint i = blockDim.x * blockIdx.x + threadIdx.x;

	float3 data = d_data[i];

	data.x = log2f(data.x);

	data.y = log2f(data.y);

	data.z = log2f(data.z);

	d_data[i] = data;  

}

__global__  void d_log2_coalesced (float* d_data)

{

	

	int index = 3 * blockIdx.x * blockDim.x + threadIdx.x;

	__shared__ float s_data[BLOCK_DIM*3];

	s_data[threadIdx.x]    = d_data[index];

	s_data[threadIdx.x + blockDim.x]  = d_data[index + blockDim.x];

	s_data[threadIdx.x + blockDim.x * 2]  = d_data[index + blockDim.x * 2];

	__syncthreads();

	float3 data = ((float3*)s_data)[threadIdx.x];

	data.x = log2f(data.x);

	data.y = log2f(data.y);

	data.z = log2f(data.z);

	((float3*)s_data)[threadIdx.x] = data;  

	__syncthreads();

	d_data[index]          	= s_data[threadIdx.x];

	d_data[index + blockDim.x]  	= s_data[threadIdx.x + blockDim.x];

	d_data[index + blockDim.x * 2]  = s_data[threadIdx.x + blockDim.x * 2];

}

The coalesced version seems to be optimal…

That very much depends on your constraints (I would guess for some reason you absolutely want to work on float3 because what you actually want to do is not independant for x, y and z), but generally

__global__  void d_log2_coalesced (float* d_data)

{

int index = blockIdx.x * blockDim.x + threadIdx.x;

d_data[index] = log2f(d_data[index]);

}

should be far better and has the same result (of course you must adjust the grid dimension).

Yes you are right but the code was written to compare coalesced vs uncoalesced mode … That’s why I took elements 3 by 3 (12 bytes) instead of 4, 8 or 16 bytes.

To have the best function I would have implemented :

__global__ void d_log2 (float* d_data, uint nbElemLign)

{

        	uint i = blockDim.x * blockIdx.x + threadIdx.x;

        

        	if(i < nbElemLign)

        	{

                  float4 data = ((float4*)d_data)[i];

                	

                        	data.x = log2f(data.x);

                        	data.y = log2f(data.y);

                        	data.z = log2f(data.z);

                        	data.w = log2f(data.w);

                

                  ((float4*)d_data)[i] = data;

        	}

}

Which is, in my point of view, the fastest solution to apply the log2 function in a vector.

Except that float4 reads (even coalesced ones) are slow on compute 1.0 hardware…