How to optimise this code

DenisR · November 21, 2007, 8:21pm

It is from matlab code, but I will try to make it clear ;)

Na is typically 25, Nb is typically 20, but we might want to increase these values when the CUDA code is orders of a magnitude faster than our current matlab code. These values are kept low to keep the processing time acceptable.

Nc is typically around 40000.

A = zeros(Na,Nb);

A2 = zeros(Na,Nb);

B = rand(Na,Nb);

C = rand(Na,Nb);

d = rand(1,Nc);

for k = 1 : Nc

    A = A + d(k)*B;

    A2 = A2 + d(k)*C;

end

So basically I am generating 40000-ish matrices and generating the sum of them all in the third dimension, so I am left with one matrix of that same size.

My idea was to do the following:

Generate a grid of dimension Na,Nb, with 512 threads each. And generate 511 values in a shared array of size 512. Set the first element in the shared array to 0 (first time) or to the sum up till now. And then in a for-loop I do an offset += 511; until I have had all the elements of my d vector

Code_written_right_now follows (I will try the code Friday when I have access to my machine again, but if people have suggestions of how to do it better, I might write smarter code tomorrow…) So it might be that there are some errors/typo’s in the code.

__global__ void mykernel(float *A, , float *A2, float *B, float *C, float *d, int Nc, int offset)

__shared__ float TEMP1[512];

__shared__ float TEMP2[512];

int ti = threadIndex.x;

int index1 = threadIndex.x+ offset -1;

int index2 = blockIndex.x + blockIndex.y * blockDim.x;

if (ti==0) { /* Initialise the first element to the appropriate value */

  if (offset==0) {

    TEMP1[ti] = 0.0f;

    TEMP2[ti] = 0.0f;

  } else {

    TEMP1[0] = A[index2];

    TEMP2[0] = A2[index2];

  }

} else { /* fill the 511 remaining elements with calculated values */

  if (index1 <Nc) {

    TEMP1[ti] = d[index1] * B[index2];

    TEMP2[ti] = d[index1] * C[index2];

  } else { /* unless there are no values in my d array anymore */

    TEMP1[ti] = 0.0f;

    TEMP2[ti] = 0.0f;

  }

}

/* do a parallel reduction of the 512 elements (first element is 0, or the sum up to now)  (basically copied from the parallel reduction sheets from Mark Harris) */

__syncthreads();

if (tid < 256) {

   TEMP1[tid] += TEMP1[tid + 256];

   TEMP2[tid] += TEMP2[tid + 256];

}

__syncthreads(); 

if (tid < 128) {

   TEMP1[tid] += TEMP1[tid + 128];

   TEMP2[tid] += TEMP2[tid + 128];

}

__syncthreads();

if (tid < 64) {

   TEMP1[tid] += TEMP1[tid + 64];

   TEMP2[tid] += TEMP2[tid + 64];

}

__syncthreads();

if (tid < 32) {

 TEMP1[tid] += TEMP1[tid + 32];

 TEMP2[tid] += TEMP2[tid + 32];

 TEMP1[tid] += TEMP1[tid + 16];

 TEMP2[tid] += TEMP2[tid + 16];

 TEMP1[tid] += TEMP1[tid + 8];

 TEMP2[tid] += TEMP2[tid + 8];

 TEMP1[tid] += TEMP1[tid + 4];

 TEMP2[tid] += TEMP2[tid + 4];

 TEMP1[tid] += TEMP1[tid + 2];

 TEMP2[tid] += TEMP2[tid + 2];

 TEMP1[tid] += TEMP1[tid + 1];

 TEMP2[tid] += TEMP2[tid + 1];

}

__syncthreads();

/* store to global memory */

if (tid == 0) {

 A[index2] = TEMP1[0];

 A2[index2] = TEMP2[0];

}

And then do something in my calling c-code like:

gridDim.x = Na;

gridDim.y = Nb;

for(offset = 0; offset < Nc; offset += 511)

  mykernel<<gridDim,512>>(A, A2,  B, C, d, Nc, offset);

Any comments on what I could do better. Is it smart to make my grid the same size as my matrix? It certainly keeps it more simple for a novice in CUDA like me, and I guess the fact that I use the same index into my global arrays A, A2, B and C in each block might help performance. But especially the first part where I have nested if-else constructs makes me feel a bit suspicious about the performance.

I must say, the last few days I have been trying to adjust the scanLargeArray example to fit my needs, but that has been an uneasy exercise I must say, so I definitely have to get up to speed in thinking parallel & thinking CUDA-optimized :D

greetz,

Dènis

asadafag · November 22, 2007, 5:04am

You may get considerable speed up by just putting the loop
for(offset = 0; offset < Nc; offset += 511)
in the kernel.

DenisR · November 22, 2007, 7:35am

Thanks! That should bring the reading/writing from/to global memory down & avoid the overhead of calling kernels in a loop right? Are there more factors that could speed it up that I am missing? The reading of global memory will be in the loop, so it should not provide better latency hiding I would guess.

Something that also helps is that I can now use all 512 threads for calculations, so less thread-divergence.

I also think it is probably smarter to call my kernel like this :

mykernel<<Na*Nb,512>>(A, A2, B, C, d, Nc);

To avoid this calculation in the kernel : index2 = blockIndex.x + blockDim.x * blockIndex.y;

Now my code looks like this.

__global__ void mykernel(float *A, , float *A2, float *B, float *C, float *d, int Nc)

{

__shared__ float TEMP1[512];

__shared__ float TEMP2[512];

__shared__ float b_val;

__shared__ float c_val;

int ti = threadIndex.x;

int index2 = blockIndex.x;

/* get the values from the B and C matrix that are to be used in this block */

if (ti == 0) {

  b_val = B[index2];

  c_val = C[index2];

}

/* Set the initial values of the shared temporary array to 0 */

TEMP1[ti] = 0.0f;

TEMP2[ti] = 0.0f;

/* make sure b_val and c_val are initialized */

__syncthreads();

/* loop over all values in the d array for this complete threadblock */

for(offset = 0; offset < Nc; offset += 512) {

  int index1 = ti + offset;

 if (index1 < Nc) {

    TEMP1[ti] += d[index1] * b_val;

    TEMP2[ti] += d[index1] * c_val;

  }

}

/* do a parallel reduction of the 512 elements (basically copied from the parallel reduction sheets from Mark Harris) */

__syncthreads();

if (tid < 256) {

  TEMP1[tid] += TEMP1[tid + 256];

  TEMP2[tid] += TEMP2[tid + 256];

}

__syncthreads();

if (tid < 128) {

  TEMP1[tid] += TEMP1[tid + 128];

  TEMP2[tid] += TEMP2[tid + 128];

}

__syncthreads();

if (tid < 64) {

  TEMP1[tid] += TEMP1[tid + 64];

  TEMP2[tid] += TEMP2[tid + 64];

}

__syncthreads();

if (tid < 32) {

  TEMP1[tid] += TEMP1[tid + 32];

  TEMP2[tid] += TEMP2[tid + 32];

  TEMP1[tid] += TEMP1[tid + 16];

  TEMP2[tid] += TEMP2[tid + 16];

  TEMP1[tid] += TEMP1[tid + 8];

  TEMP2[tid] += TEMP2[tid + 8];

  TEMP1[tid] += TEMP1[tid + 4];

  TEMP2[tid] += TEMP2[tid + 4];

  TEMP1[tid] += TEMP1[tid + 2];

  TEMP2[tid] += TEMP2[tid + 2];

  TEMP1[tid] += TEMP1[tid + 1];

  TEMP2[tid] += TEMP2[tid + 1];

}

__syncthreads();

/* store to global memory */

if (tid == 0) {

  A[index2] = TEMP1[0];

  A2[index2] = TEMP2[0];

}

}

Topic		Replies	Views
I want to calculate the sum of the 512 lines CUDA Programming and Performance	16	2244	January 4, 2013
sum of all elements of a matrix CUDA Programming and Performance	11	36939	October 18, 2010
Questions about a double sum code optimization. CUDA Programming and Performance	0	1192	January 15, 2010
Summing matrix elements CUDA Programming and Performance	3	7035	July 4, 2011
Simple Inefficient Parallel Addition CUDA Programming and Performance	5	3276	April 10, 2009
Summing threads CUDA Programming and Performance	3	3272	June 7, 2011
How to get faster than 5-8 seconds CUDA Programming and Performance	12	3224	September 15, 2008
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	14216	September 5, 2008
help to go in the good direction with CUDA a simple but slow code in C to optimise CUDA Programming and Performance	7	3893	September 24, 2008
Taking the sum over the last 2 dimensions of a 3D array A CUDA Programming and Performance	5	1011	October 31, 2015

How to optimise this code

Related topics