Hi all,

So I have two kernels running on the same data, but organised in a different way. To put it simply these trivial kernels just add four vectors together.

- Four float arrays

```
__global__ void addKernelfloats(float *out, float *a, float *b, float *c, float *d)
{
int pos = threadIdx.x + blockIdx.x * blockDim.x;
int totalDim = blockDim.x * gridDim.x;
while(pos < ARRAYSIZE){
out[pos] = a[pos] + b[pos] + c[pos] + d[pos];
pos += totalDim;
}
}
```

- One array of float4’s, using shared mem to achieve coalescing.

```
#define THREADCOUNT 128
__global__ void addKernelstructs(float *out, float *indata)
{
__shared__ float sdata[THREADCOUNT*4];
int valBlock = blockIdx.x * blockDim.x;
int outpos = threadIdx.x + valBlock;
int pos = (valBlock * 4) + threadIdx.x;
int totalDim = blockDim.x * gridDim.x;
int totalDimQuad = totalDim * 4;
float* num1 = &(sdata[threadIdx.x]);
float* num2 = &(sdata[threadIdx.x + THREADCOUNT]);
float* num3 = &(sdata[threadIdx.x + (THREADCOUNT*2)]);
float* num4 = &(sdata[threadIdx.x + (THREADCOUNT*3)]);
float4 *a = &((float4*)sdata)[threadIdx.x];
while(outpos < ARRAYSIZE){
(*num1) = indata[pos];
(*num2) = indata[pos + THREADCOUNT];
(*num3) = indata[pos + (THREADCOUNT*2)];
(*num4) = indata[pos + (THREADCOUNT*3)];
__syncthreads();
out[outpos] = a->w + a->x + a->y + a->z;
outpos += totalDim;
pos += totalDimQuad;
}
}
```

Now I have tested both kernels and they work correctly (10M+ values), but what I want to know is…why the hell is kernel 2 always faster than kernel 1? I have tried playing with block and thread dimensions, using one single float array in 1. for all four vectors. No matter what I do, kernel 2 is still faster. Any ideas why, as intuitively I would have thought kernel 1 would be faster.

Also note, while 2 is faster, the performance difference is tiny so this isn’t too big of a deal, I am just interested. Am I doing something wrong?

Also: I’m using one GPU from my GTX590