How would you optimise this simple kernel ? It takes to long....

Hi everyone, I wrote a pretty simple kernel which aim at applying a 512 floats long filter column wise to a 512(columns) x 2048 (rows) floats input matrix data.

So the result is a 512 x (2048-512) matrix (ie : there is 512 x 1536 multiplications of vectors or 512 elements). I have to compute all that stuff in less than 5 ms.

The filter (a 512 floats vector) stands in the constant memory

The input data matrix is a 2D texture

The result is written in global memory

This was my very simple original kernel :

__global__ void FIR_Kernel(float* d_RFResult) 

{

    float res;

   

    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

  

    unsigned int yBase = blockIdx.y*NBCALC_PER_THREAD;

  

    

    for(unsigned int i = 0; i < NBCALC_PER_THREAD; i++)

        {

            res = 0.0f;                        

            

            for(unsigned int j = 0; j < FILTER_KERNEL_SIZE; j++)

            {             

                res += filter[j] * tex2D(texRFData, x,yBase+i+j);               

            }          

                            

            d_RFResult[x +(yBase+i)*NB_CHANNELS] = res;                  

        }    

}

This code takes 30ms.

I reached 15 ms by packing up every float in float4 structure.

There’s a 3 factor stille to gain.

90% of the GPU time is spent in this memory transfert, which seems normal as it :

d_RFResult[x +(yBase+i)*NB_CHANNELS] = res;

I’m not sure this global memory write is coalesce, I think so as consequent threads in the same warp write in consecutive memory locations, and as d_RFResult is a float * pointer, it should be aligned, but I tried to write at random places in d_RFResult(global memory) and it took the same time.

I have to admit that dispite the documentation, in practice, I find it hard to know it reads and write are really coalesce or not.

By packing data in aligned types of 128 bits, it’s much faster because of the 1128 write instead of 432 (by near a *4 factor). I’m okay whith that, but it’s true for a single thread and does not show the so called coalesced state, which should involve many threads in the same warp.

How would you accelerate : d_RFResult[x +(yBase+i)*NB_CHANNELS] = res;

other than dealing with float4 ?

Thanks

About the coalescing: Having a non-coalescing memory write could easily drop your memory write performance from ~60GB/s down to 0.5 GB/s.

d_RFResult[x +(yBase+i)*NB_CHANNELS] = res;

Your x increases with stride 1 as threadIdx.x increases, so that is good. The first requirement of coalescing is that warps access memory in stride 1 rows. The second requirement is probably what is giving you trouble. (yBase+i)*NB_CHANNELS must be aligned to be a multiple of 32 so that all the warps are starting at an aligned memory address (or maybe it’s 16 for the half warp, I don’t recall exactly).

I don’t have any great insight on how to help you achieve this alignment, but perhaps my way of stating it will help you understand it better. Oh, and one more thing: In my tests writing fully coalesced float’s is faster than writing fully coalesced float4’s. So I would suggest trying to optimize the single float version of the code first.

Memory that is alloced by cudaMalloc is aligned to 256 bytes. This fulfills all alignment criteria. If your warps start to write to this pointer with thread (0,0), you will get full coalescing.

To check the addresses, simply take the address of the global mem location to write to, store it in an additional (per thread sized) space, read it back to the CPU and check that for every warp start (adr & 0xf) ==0

Peter

thanks a lot =)