I have implemented my multiply+accumulate kernel in the following way:
//int blocksize = 256; //multiple of 32
//int nblocks = ((pix3*npix*npix)+blocksize-1)/blocksize; //round to max where pix3 =400*50, npix = 7
// printf("\nnblocks = %d\n", nblocks);
// xc_corr <<< nblocks,blocksize >>> (ffcorr1, ref_d, pix1, pix2, npix, xcout, n, xc_partial, sum_xc);
__global__ void xc_corr(double* in_im, double* ref_im, int nx, int ny, int npix, double* out1, int n, double* xc_partial, double* sum_xc)
{
int k = blockIdx.x * blockDim.x + threadIdx.x;
int i = 0;
int j = 0;
int p = 0;
int q = 0;
if(k < n) //n = 50
{
for(i = 0; i < npix; ++i)
{
for(j = 0; j < npix; ++j)
{
sum_xc[k] = 0; //force update
for(p = 0; p < ny; ++p) //ny = 400
{
for(q = 0; q < ny; ++q) //ny = 400
{
xc_partial[(k*ny*ny)+((i*npix+j)*ny*ny)+p*ny+q] = in_im[(k*ny*ny)+p*ny+q] * ref_im[(k*ny*ny)+((i*npix+j)*ny*ny)+p*ny+q]; //array multiplication
sum_xc[k] = sum_xc[k] + xc_partial[(k*ny*ny)+((i*npix+j)*ny*ny)+p*ny+q]; //reduction
}
}
out1[(k*npix*npix)+i*npix+j] = sum_xc[k];
}
}
}
}
When I profiled this kernel it took the maximum time for execution about 136ms (98% of the total execution time). I would like to optimize it and see if I could reduce the execution time. I am lost regarding where to optimize for?
My thoughts were:
-
If I could convert all the for loops to utilize the threads executing per block, would that be a better alternative?
-
Would shared memory be a better alternative?
My ptxas output loks like this:
ptxas info : Compiling entry function ‘Z7xc_corrPdS_iiiS_iS_S’ for ‘sm_13’
ptxas info : Used 15 registers, 64+16 bytes smem, 4 bytes cmem[1]
Any feedback/suggestions welcome.
Thanks :)