How to optimise this code

It is from matlab code, but I will try to make it clear ;)

Na is typically 25, Nb is typically 20, but we might want to increase these values when the CUDA code is orders of a magnitude faster than our current matlab code. These values are kept low to keep the processing time acceptable.

Nc is typically around 40000.

A = zeros(Na,Nb);

A2 = zeros(Na,Nb);

B = rand(Na,Nb);

C = rand(Na,Nb);

d = rand(1,Nc);

for k = 1 : Nc

    A = A + d(k)*B;

    A2 = A2 + d(k)*C;

end

So basically I am generating 40000-ish matrices and generating the sum of them all in the third dimension, so I am left with one matrix of that same size.

My idea was to do the following:

Generate a grid of dimension Na,Nb, with 512 threads each. And generate 511 values in a shared array of size 512. Set the first element in the shared array to 0 (first time) or to the sum up till now. And then in a for-loop I do an offset += 511; until I have had all the elements of my d vector

Code_written_right_now follows (I will try the code Friday when I have access to my machine again, but if people have suggestions of how to do it better, I might write smarter code tomorrow…) So it might be that there are some errors/typo’s in the code.

__global__ void mykernel(float *A, , float *A2, float *B, float *C, float *d, int Nc, int offset)

__shared__ float TEMP1[512];

__shared__ float TEMP2[512];

int ti = threadIndex.x;

int index1 = threadIndex.x+ offset -1;

int index2 = blockIndex.x + blockIndex.y * blockDim.x;

if (ti==0) { /* Initialise the first element to the appropriate value */

  if (offset==0) {

    TEMP1[ti] = 0.0f;

    TEMP2[ti] = 0.0f;

  } else {

    TEMP1[0] = A[index2];

    TEMP2[0] = A2[index2];

  }

} else { /* fill the 511 remaining elements with calculated values */

  if (index1 <Nc) {

    TEMP1[ti] = d[index1] * B[index2];

    TEMP2[ti] = d[index1] * C[index2];

  } else { /* unless there are no values in my d array anymore */

    TEMP1[ti] = 0.0f;

    TEMP2[ti] = 0.0f;

  }

}

/* do a parallel reduction of the 512 elements (first element is 0, or the sum up to now)  (basically copied from the parallel reduction sheets from Mark Harris) */

__syncthreads();

if (tid < 256) {

   TEMP1[tid] += TEMP1[tid + 256];

   TEMP2[tid] += TEMP2[tid + 256];

}

__syncthreads(); 

if (tid < 128) {

   TEMP1[tid] += TEMP1[tid + 128];

   TEMP2[tid] += TEMP2[tid + 128];

}

__syncthreads();

if (tid < 64) {

   TEMP1[tid] += TEMP1[tid + 64];

   TEMP2[tid] += TEMP2[tid + 64];

}

__syncthreads();

if (tid < 32) {

 TEMP1[tid] += TEMP1[tid + 32];

 TEMP2[tid] += TEMP2[tid + 32];

 TEMP1[tid] += TEMP1[tid + 16];

 TEMP2[tid] += TEMP2[tid + 16];

 TEMP1[tid] += TEMP1[tid + 8];

 TEMP2[tid] += TEMP2[tid + 8];

 TEMP1[tid] += TEMP1[tid + 4];

 TEMP2[tid] += TEMP2[tid + 4];

 TEMP1[tid] += TEMP1[tid + 2];

 TEMP2[tid] += TEMP2[tid + 2];

 TEMP1[tid] += TEMP1[tid + 1];

 TEMP2[tid] += TEMP2[tid + 1];

}

__syncthreads();

/* store to global memory */

if (tid == 0) {

 A[index2] = TEMP1[0];

 A2[index2] = TEMP2[0];

}

And then do something in my calling c-code like:

gridDim.x = Na;

gridDim.y = Nb;

for(offset = 0; offset < Nc; offset += 511)

  mykernel<<gridDim,512>>(A, A2,  B, C, d, Nc, offset);

Any comments on what I could do better. Is it smart to make my grid the same size as my matrix? It certainly keeps it more simple for a novice in CUDA like me, and I guess the fact that I use the same index into my global arrays A, A2, B and C in each block might help performance. But especially the first part where I have nested if-else constructs makes me feel a bit suspicious about the performance.

I must say, the last few days I have been trying to adjust the scanLargeArray example to fit my needs, but that has been an uneasy exercise I must say, so I definitely have to get up to speed in thinking parallel & thinking CUDA-optimized :D

greetz,

Dènis

You may get considerable speed up by just putting the loop
for(offset = 0; offset < Nc; offset += 511)
in the kernel.

Thanks! That should bring the reading/writing from/to global memory down & avoid the overhead of calling kernels in a loop right? Are there more factors that could speed it up that I am missing? The reading of global memory will be in the loop, so it should not provide better latency hiding I would guess.

Something that also helps is that I can now use all 512 threads for calculations, so less thread-divergence.

I also think it is probably smarter to call my kernel like this :

mykernel<<Na*Nb,512>>(A, A2, B, C, d, Nc);

To avoid this calculation in the kernel : index2 = blockIndex.x + blockDim.x * blockIndex.y;

Now my code looks like this.

__global__ void mykernel(float *A, , float *A2, float *B, float *C, float *d, int Nc)

{

__shared__ float TEMP1[512];

__shared__ float TEMP2[512];

__shared__ float b_val;

__shared__ float c_val;

int ti = threadIndex.x;

int index2 = blockIndex.x;

/* get the values from the B and C matrix that are to be used in this block */

if (ti == 0) {

  b_val = B[index2];

  c_val = C[index2];

}

/* Set the initial values of the shared temporary array to 0 */

TEMP1[ti] = 0.0f;

TEMP2[ti] = 0.0f;

/* make sure b_val and c_val are initialized */

__syncthreads();

/* loop over all values in the d array for this complete threadblock */

for(offset = 0; offset < Nc; offset += 512) {

  int index1 = ti + offset;

 if (index1 < Nc) {

    TEMP1[ti] += d[index1] * b_val;

    TEMP2[ti] += d[index1] * c_val;

  }

}

/* do a parallel reduction of the 512 elements (basically copied from the parallel reduction sheets from Mark Harris) */

__syncthreads();

if (tid < 256) {

  TEMP1[tid] += TEMP1[tid + 256];

  TEMP2[tid] += TEMP2[tid + 256];

}

__syncthreads();

if (tid < 128) {

  TEMP1[tid] += TEMP1[tid + 128];

  TEMP2[tid] += TEMP2[tid + 128];

}

__syncthreads();

if (tid < 64) {

  TEMP1[tid] += TEMP1[tid + 64];

  TEMP2[tid] += TEMP2[tid + 64];

}

__syncthreads();

if (tid < 32) {

  TEMP1[tid] += TEMP1[tid + 32];

  TEMP2[tid] += TEMP2[tid + 32];

  TEMP1[tid] += TEMP1[tid + 16];

  TEMP2[tid] += TEMP2[tid + 16];

  TEMP1[tid] += TEMP1[tid + 8];

  TEMP2[tid] += TEMP2[tid + 8];

  TEMP1[tid] += TEMP1[tid + 4];

  TEMP2[tid] += TEMP2[tid + 4];

  TEMP1[tid] += TEMP1[tid + 2];

  TEMP2[tid] += TEMP2[tid + 2];

  TEMP1[tid] += TEMP1[tid + 1];

  TEMP2[tid] += TEMP2[tid + 1];

}

__syncthreads();

/* store to global memory */

if (tid == 0) {

  A[index2] = TEMP1[0];

  A2[index2] = TEMP2[0];

}

}