Hi everyone, I wrote a pretty simple kernel which aim at applying a 512 floats long filter column wise to a 512(columns) x 2048 (rows) floats input matrix data.
So the result is a 512 x (2048-512) matrix (ie : there is 512 x 1536 multiplications of vectors or 512 elements). I have to compute all that stuff in less than 5 ms.
The filter (a 512 floats vector) stands in the constant memory
The input data matrix is a 2D texture
The result is written in global memory
This was my very simple original kernel :
__global__ void FIR_Kernel(float* d_RFResult)
{
float res;
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int yBase = blockIdx.y*NBCALC_PER_THREAD;
for(unsigned int i = 0; i < NBCALC_PER_THREAD; i++)
{
res = 0.0f;
for(unsigned int j = 0; j < FILTER_KERNEL_SIZE; j++)
{
res += filter[j] * tex2D(texRFData, x,yBase+i+j);
}
d_RFResult[x +(yBase+i)*NB_CHANNELS] = res;
}
}
This code takes 30ms.
I reached 15 ms by packing up every float in float4 structure.
There’s a 3 factor stille to gain.
90% of the GPU time is spent in this memory transfert, which seems normal as it :
d_RFResult[x +(yBase+i)*NB_CHANNELS] = res;
I’m not sure this global memory write is coalesce, I think so as consequent threads in the same warp write in consecutive memory locations, and as d_RFResult is a float * pointer, it should be aligned, but I tried to write at random places in d_RFResult(global memory) and it took the same time.
I have to admit that dispite the documentation, in practice, I find it hard to know it reads and write are really coalesce or not.
By packing data in aligned types of 128 bits, it’s much faster because of the 1128 write instead of 432 (by near a *4 factor). I’m okay whith that, but it’s true for a single thread and does not show the so called coalesced state, which should involve many threads in the same warp.
How would you accelerate : d_RFResult[x +(yBase+i)*NB_CHANNELS] = res;
other than dealing with float4 ?
Thanks