Performance Question

Hi all,

So I have two kernels running on the same data, but organised in a different way. To put it simply these trivial kernels just add four vectors together.

  1. Four float arrays
__global__ void addKernelfloats(float *out, float *a, float *b, float *c, float *d)

{		

	int pos = threadIdx.x + blockIdx.x * blockDim.x;

	int totalDim = blockDim.x * gridDim.x;

		

	while(pos < ARRAYSIZE){		

		out[pos] = a[pos] + b[pos] + c[pos] + d[pos];				

		pos += totalDim;

	}	

}
  1. One array of float4’s, using shared mem to achieve coalescing.
#define THREADCOUNT 128

__global__ void addKernelstructs(float *out, float *indata)

{	

	__shared__ float sdata[THREADCOUNT*4];

	int valBlock = blockIdx.x * blockDim.x;

	int outpos = threadIdx.x + valBlock;

	int pos = (valBlock * 4) + threadIdx.x;

	int totalDim = blockDim.x * gridDim.x;

	int totalDimQuad = totalDim * 4;

	float* num1 = &(sdata[threadIdx.x]);

	float* num2 = &(sdata[threadIdx.x + THREADCOUNT]);

	float* num3 = &(sdata[threadIdx.x + (THREADCOUNT*2)]);

	float* num4 = &(sdata[threadIdx.x + (THREADCOUNT*3)]);

	float4 *a = &((float4*)sdata)[threadIdx.x];

	while(outpos < ARRAYSIZE){

		(*num1) = indata[pos];

		(*num2) = indata[pos + THREADCOUNT];

		(*num3) = indata[pos + (THREADCOUNT*2)];

		(*num4) = indata[pos + (THREADCOUNT*3)];

		__syncthreads();

					

		out[outpos] = a->w + a->x + a->y + a->z;		

				

		outpos += totalDim;

		pos += totalDimQuad;

	}

}

Now I have tested both kernels and they work correctly (10M+ values), but what I want to know is…why the hell is kernel 2 always faster than kernel 1? I have tried playing with block and thread dimensions, using one single float array in 1. for all four vectors. No matter what I do, kernel 2 is still faster. Any ideas why, as intuitively I would have thought kernel 1 would be faster.

Also note, while 2 is faster, the performance difference is tiny so this isn’t too big of a deal, I am just interested. Am I doing something wrong?

Also: I’m using one GPU from my GTX590

Hi,

I have tested the two kernels and i get better performance on kernel1 :D. As you said it is just a tiny difference.

On the CPU side, are you launching kernel1 first followed by kernel 2?

If you want to have a slightly faster solution you could use the same concept as in kernel1 but pass in a float4* instead of 4 float*

__global__ void addKernelf4(float *out, float4 *a){

		int pos = threadIdx.x + blockIdx.x * blockDim.x;

        int totalDim = blockDim.x * gridDim.x;

while(pos < ARRAYSIZE){

                out[pos] = a[pos].x + a[pos].y + a[pos].z + a[pos].w;                       	

                pos += totalDim;

        }   	

}

I got better performance on the kernel above.

I profiled all three kernels with Visual profiler running on GTX480.

The limitation for the kernels is the memory bandwidth

Hi, thanks for testing. I ended up playing with the third kernel, and here is the fastest one I could come up with.

__global__ void addKernelfloat4(float *out, float4 *data)

{

	__shared__ float4 sdata[THREADCOUNT];

	int pos = threadIdx.x + blockIdx.x * blockDim.x;

	int totalDim = blockDim.x * gridDim.x;

	float4 *a = &sdata[threadIdx.x];

		

	while(pos < ARRAYSIZE){	

		(*a) = data[pos];

		__syncthreads();

		out[pos] = a->w + a->x + a->y + a->z;				

		pos += totalDim;

	}	

}

Here is the interesting thing. The kernel above runs fine without the __syncthreads() call, however if I add it, I get a small performance increase. I was of the opinion that you should try and avoid __syncthreads() where possible. Any ideas why adding it would improve performance for me?

With the threads in sync it might be that you have a slightly higher chance for consecutive reads and writes to hit the same row that is already open.