How would you optimise this simple kernel ? It takes to long....

devnulll · June 6, 2007, 12:25pm

Hi everyone, I wrote a pretty simple kernel which aim at applying a 512 floats long filter column wise to a 512(columns) x 2048 (rows) floats input matrix data.

So the result is a 512 x (2048-512) matrix (ie : there is 512 x 1536 multiplications of vectors or 512 elements). I have to compute all that stuff in less than 5 ms.

The filter (a 512 floats vector) stands in the constant memory

The input data matrix is a 2D texture

The result is written in global memory

This was my very simple original kernel :

__global__ void FIR_Kernel(float* d_RFResult) 

{

    float res;

   

    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

  

    unsigned int yBase = blockIdx.y*NBCALC_PER_THREAD;

  

    

    for(unsigned int i = 0; i < NBCALC_PER_THREAD; i++)

        {

            res = 0.0f;                        

            

            for(unsigned int j = 0; j < FILTER_KERNEL_SIZE; j++)

            {             

                res += filter[j] * tex2D(texRFData, x,yBase+i+j);               

            }          

                            

            d_RFResult[x +(yBase+i)*NB_CHANNELS] = res;                  

        }    

}

This code takes 30ms.

I reached 15 ms by packing up every float in float4 structure.

There’s a 3 factor stille to gain.

90% of the GPU time is spent in this memory transfert, which seems normal as it :

d_RFResult[x +(yBase+i)*NB_CHANNELS] = res;

I’m not sure this global memory write is coalesce, I think so as consequent threads in the same warp write in consecutive memory locations, and as d_RFResult is a float * pointer, it should be aligned, but I tried to write at random places in d_RFResult(global memory) and it took the same time.

I have to admit that dispite the documentation, in practice, I find it hard to know it reads and write are really coalesce or not.

By packing data in aligned types of 128 bits, it’s much faster because of the 1128 write instead of 432 (by near a *4 factor). I’m okay whith that, but it’s true for a single thread and does not show the so called coalesced state, which should involve many threads in the same warp.

How would you accelerate : d_RFResult[x +(yBase+i)*NB_CHANNELS] = res;

other than dealing with float4 ?

Thanks

MisterAnderson42 · June 6, 2007, 1:32pm

About the coalescing: Having a non-coalescing memory write could easily drop your memory write performance from ~60GB/s down to 0.5 GB/s.

d_RFResult[x +(yBase+i)*NB_CHANNELS] = res;

Your x increases with stride 1 as threadIdx.x increases, so that is good. The first requirement of coalescing is that warps access memory in stride 1 rows. The second requirement is probably what is giving you trouble. (yBase+i)*NB_CHANNELS must be aligned to be a multiple of 32 so that all the warps are starting at an aligned memory address (or maybe it’s 16 for the half warp, I don’t recall exactly).

I don’t have any great insight on how to help you achieve this alignment, but perhaps my way of stating it will help you understand it better. Oh, and one more thing: In my tests writing fully coalesced float’s is faster than writing fully coalesced float4’s. So I would suggest trying to optimize the single float version of the code first.

prkipfer · June 6, 2007, 1:38pm

Memory that is alloced by cudaMalloc is aligned to 256 bytes. This fulfills all alignment criteria. If your warps start to write to this pointer with thread (0,0), you will get full coalescing.

To check the addresses, simply take the address of the global mem location to write to, store it in an additional (per thread sized) space, read it back to the CPU and check that for every warp start (adr & 0xf) ==0

Peter

devnulll · June 7, 2007, 7:35am

thanks a lot =)

Topic		Replies	Views
Speeding up memory writes CUDA Programming and Performance	5	3313	July 3, 2008
Global memory write cost CUDA Programming and Performance	4	7995	March 11, 2011
Kernel speed drops from 1.5 to 6 minutes Coalescing memory needed? CUDA Programming and Performance	5	2442	July 24, 2008
Kernel has 0 coalesced reads/writes... Profiler reveals my newbness CUDA Programming and Performance	1	1120	February 18, 2009
Help improving performance CUDA Programming and Performance	19	14387	June 10, 2009
Coalesced? CUDA Programming and Performance	6	2924	February 7, 2009
Isn't that Coalesced?! writing to global memory in a coalesced way CUDA Programming and Performance	9	10283	June 28, 2009
Help with kernel CUDA Programming and Performance	6	1682	April 23, 2010
Coalescing access CUDA Programming and Performance	3	802	March 2, 2012
hwo to make float2 and float4 data coalesced? CUDA Programming and Performance	1	3604	May 27, 2008

How would you optimise this simple kernel ? It takes to long....

Related topics