Hi everyone, I wrote a pretty simple kernel which aim at applying a 512 floats long filter column wise to a 512(columns) x 2048 (rows) floats input matrix data.

So the result is a 512 x (2048-512) matrix (ie : there is 512 x 1536 multiplications of vectors or 512 elements). I have to compute all that stuff in less than 5 ms.

The filter (a 512 floats vector) stands in the constant memory

The input data matrix is a 2D texture

The result is written in global memory

This was my very simple original kernel :

```
__global__ void FIR_Kernel(float* d_RFResult)
{
float res;
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int yBase = blockIdx.y*NBCALC_PER_THREAD;
for(unsigned int i = 0; i < NBCALC_PER_THREAD; i++)
{
res = 0.0f;
for(unsigned int j = 0; j < FILTER_KERNEL_SIZE; j++)
{
res += filter[j] * tex2D(texRFData, x,yBase+i+j);
}
d_RFResult[x +(yBase+i)*NB_CHANNELS] = res;
}
}
```

This code takes 30ms.

I reached 15 ms by packing up every float in float4 structure.

There’s a 3 factor stille to gain.

90% of the GPU time is spent in this memory transfert, which seems normal as it :

d_RFResult[x +(yBase+i)*NB_CHANNELS] = res;

I’m not sure this global memory write is coalesce, I think so as consequent threads in the same warp write in consecutive memory locations, and as d_RFResult is a float * pointer, it should be aligned, but I tried to write at random places in d_RFResult(global memory) and it took the same time.

I have to admit that dispite the documentation, in practice, I find it hard to know it reads and write are really coalesce or not.

By packing data in aligned types of 128 bits, it’s much faster because of the 1*128 write instead of 4*32 (by near a *4 factor). I’m okay whith that, but it’s true for a single thread and does not show the so called coalesced state, which should involve many threads in the same warp.

How would you accelerate : d_RFResult[x +(yBase+i)*NB_CHANNELS] = res;

other than dealing with float4 ?

Thanks