CUDA, performance evaluation

D1mmu · July 25, 2008, 4:21pm

Hi all,

I did some performances evaluations about sorts and log2 on GPU for my final MSc Thesis. These results help me to see what I can implement in CUDA to improve the performance of my program. I think that It could be useful for some of you so I post my results here :

sorts using a GeForce 8800GTS :

http://www.ab2patrimoine.be/adrien/upload/…/graphsort1.jpg

sorts using a GeForce 8600m GS :

http://www.ab2patrimoine.be/adrien/upload/…/graphsort2.jpg

log2 CPU vs log2 GPU using a GeForce 8800 GTS :

http://www.ab2patrimoine.be/adrien/upload/…s/log2-8800.jpg

log2 CPU vs log2 GPU using a GeForce 8600m GS :

http://www.ab2patrimoine.be/adrien/upload/…ers/log2low.jpg

log2 coalesced memory vs log2 uncoalesced memory :

http://www.ab2patrimoine.be/adrien/upload/…escedmemory.jpg

I don’t understand why the difference between coalesced memory and uncoalesced memory is so small (in GPU Gems 3 and other papers, the difference is impressive)… Does someone have an explication?

The code used to test uncoa. vs coa. :

// Perform log2 operations on a float3 vector

__global__  void d_log2_uncoalesced (float3* d_data)

{

	uint i = blockDim.x * blockIdx.x + threadIdx.x;

	float3 data = d_data[i];

	data.x = log2f(data.x);

	data.y = log2f(data.y);

	data.z = log2f(data.z);

	d_data[i] = data;  

}

__global__  void d_log2_coalesced (float* d_data)

{

	

	int index = 3 * blockIdx.x * blockDim.x + threadIdx.x;

	__shared__ float s_data[BLOCK_DIM*3];

	s_data[threadIdx.x]    = d_data[index];

	s_data[threadIdx.x + blockDim.x]  = d_data[index + blockDim.x];

	s_data[threadIdx.x + blockDim.x * 2]  = d_data[index + blockDim.x * 2];

	__syncthreads();

	float3 data = ((float3*)s_data)[threadIdx.x];

	data.x = log2f(data.x);

	data.y = log2f(data.y);

	data.z = log2f(data.z);

	((float3*)s_data)[threadIdx.x] = data;  

	__syncthreads();

	d_data[index]          	= s_data[threadIdx.x];

	d_data[index + blockDim.x]  	= s_data[threadIdx.x + blockDim.x];

	d_data[index + blockDim.x * 2]  = s_data[threadIdx.x + blockDim.x * 2];

}

The coalesced version seems to be optimal…

Reimar · July 27, 2008, 9:22am

That very much depends on your constraints (I would guess for some reason you absolutely want to work on float3 because what you actually want to do is not independant for x, y and z), but generally

__global__  void d_log2_coalesced (float* d_data)

{

int index = blockIdx.x * blockDim.x + threadIdx.x;

d_data[index] = log2f(d_data[index]);

}

should be far better and has the same result (of course you must adjust the grid dimension).

D1mmu · July 30, 2008, 9:56pm

That very much depends on your constraints (I would guess for some reason you absolutely want to work on float3 because what you actually want to do is not independant for x, y and z), but generally
__global__ Â void d_log2_coalesced (float* d_data)

{

int index = blockIdx.x * blockDim.x + threadIdx.x;

d_data[index] = log2f(d_data[index]);

}
should be far better and has the same result (of course you must adjust the grid dimension).

[snapback]416628[/snapback]

Yes you are right but the code was written to compare coalesced vs uncoalesced mode … That’s why I took elements 3 by 3 (12 bytes) instead of 4, 8 or 16 bytes.

To have the best function I would have implemented :

__global__ void d_log2 (float* d_data, uint nbElemLign)

{

        	uint i = blockDim.x * blockIdx.x + threadIdx.x;

        

        	if(i < nbElemLign)

        	{

                  float4 data = ((float4*)d_data)[i];

                	

                        	data.x = log2f(data.x);

                        	data.y = log2f(data.y);

                        	data.z = log2f(data.z);

                        	data.w = log2f(data.w);

                

                  ((float4*)d_data)[i] = data;

        	}

}

Which is, in my point of view, the fastest solution to apply the log2 function in a vector.

MisterAnderson42 · July 31, 2008, 12:08pm

Except that float4 reads (even coalesced ones) are slow on compute 1.0 hardware…

Topic		Replies	Views
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5078	September 6, 2008
Global Memory vs Constant vs Texture Fetch Performance CUDA Programming and Performance	12	7301	March 10, 2009
Basic parallel programming need some help CUDA Programming and Performance	15	6352	June 15, 2011
How would you do this? CUDA Programming and Performance	12	4467	August 5, 2008
Why is the performance more? Refering to Dr Dobbs article CUDA Programming and Performance	10	2641	April 23, 2010
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6548	February 19, 2009
Is my kernel too simple to get a speed increase from CUDA? CUDA Programming and Performance	18	3830	February 2, 2010
Speed improvement CUDA Programming and Performance	18	8268	December 5, 2008
CUDA is slower than expected. Is something missing? CUDA Programming and Performance cuda , gpu , gpu-computing , parallel-computing	4	243	July 7, 2024
How to decide BLOCK and GRID size for kernel code Optimization of image processing kernel CUDA Programming and Performance	7	6071	October 26, 2007

CUDA, performance evaluation

Related topics