 # How to optimise this code

It is from matlab code, but I will try to make it clear ;)

Na is typically 25, Nb is typically 20, but we might want to increase these values when the CUDA code is orders of a magnitude faster than our current matlab code. These values are kept low to keep the processing time acceptable.

Nc is typically around 40000.

``````A = zeros(Na,Nb);

A2 = zeros(Na,Nb);

B = rand(Na,Nb);

C = rand(Na,Nb);

d = rand(1,Nc);

for k = 1 : Nc

A = A + d(k)*B;

A2 = A2 + d(k)*C;

end
``````

So basically I am generating 40000-ish matrices and generating the sum of them all in the third dimension, so I am left with one matrix of that same size.

My idea was to do the following:

Generate a grid of dimension Na,Nb, with 512 threads each. And generate 511 values in a shared array of size 512. Set the first element in the shared array to 0 (first time) or to the sum up till now. And then in a for-loop I do an offset += 511; until I have had all the elements of my d vector

Code_written_right_now follows (I will try the code Friday when I have access to my machine again, but if people have suggestions of how to do it better, I might write smarter code tomorrow…) So it might be that there are some errors/typo’s in the code.

``````__global__ void mykernel(float *A, , float *A2, float *B, float *C, float *d, int Nc, int offset)

__shared__ float TEMP1;

__shared__ float TEMP2;

int index1 = threadIndex.x+ offset -1;

int index2 = blockIndex.x + blockIndex.y * blockDim.x;

if (ti==0) { /* Initialise the first element to the appropriate value */

if (offset==0) {

TEMP1[ti] = 0.0f;

TEMP2[ti] = 0.0f;

} else {

TEMP1 = A[index2];

TEMP2 = A2[index2];

}

} else { /* fill the 511 remaining elements with calculated values */

if (index1 <Nc) {

TEMP1[ti] = d[index1] * B[index2];

TEMP2[ti] = d[index1] * C[index2];

} else { /* unless there are no values in my d array anymore */

TEMP1[ti] = 0.0f;

TEMP2[ti] = 0.0f;

}

}

/* do a parallel reduction of the 512 elements (first element is 0, or the sum up to now)  (basically copied from the parallel reduction sheets from Mark Harris) */

if (tid < 256) {

TEMP1[tid] += TEMP1[tid + 256];

TEMP2[tid] += TEMP2[tid + 256];

}

if (tid < 128) {

TEMP1[tid] += TEMP1[tid + 128];

TEMP2[tid] += TEMP2[tid + 128];

}

if (tid < 64) {

TEMP1[tid] += TEMP1[tid + 64];

TEMP2[tid] += TEMP2[tid + 64];

}

if (tid < 32) {

TEMP1[tid] += TEMP1[tid + 32];

TEMP2[tid] += TEMP2[tid + 32];

TEMP1[tid] += TEMP1[tid + 16];

TEMP2[tid] += TEMP2[tid + 16];

TEMP1[tid] += TEMP1[tid + 8];

TEMP2[tid] += TEMP2[tid + 8];

TEMP1[tid] += TEMP1[tid + 4];

TEMP2[tid] += TEMP2[tid + 4];

TEMP1[tid] += TEMP1[tid + 2];

TEMP2[tid] += TEMP2[tid + 2];

TEMP1[tid] += TEMP1[tid + 1];

TEMP2[tid] += TEMP2[tid + 1];

}

/* store to global memory */

if (tid == 0) {

A[index2] = TEMP1;

A2[index2] = TEMP2;

}
``````

And then do something in my calling c-code like:

``````gridDim.x = Na;

gridDim.y = Nb;

for(offset = 0; offset < Nc; offset += 511)

mykernel<<gridDim,512>>(A, A2,  B, C, d, Nc, offset);
``````

Any comments on what I could do better. Is it smart to make my grid the same size as my matrix? It certainly keeps it more simple for a novice in CUDA like me, and I guess the fact that I use the same index into my global arrays A, A2, B and C in each block might help performance. But especially the first part where I have nested if-else constructs makes me feel a bit suspicious about the performance.

I must say, the last few days I have been trying to adjust the scanLargeArray example to fit my needs, but that has been an uneasy exercise I must say, so I definitely have to get up to speed in thinking parallel & thinking CUDA-optimized :D

greetz,

Dènis

You may get considerable speed up by just putting the loop
for(offset = 0; offset < Nc; offset += 511)
in the kernel.

Thanks! That should bring the reading/writing from/to global memory down & avoid the overhead of calling kernels in a loop right? Are there more factors that could speed it up that I am missing? The reading of global memory will be in the loop, so it should not provide better latency hiding I would guess.

Something that also helps is that I can now use all 512 threads for calculations, so less thread-divergence.

I also think it is probably smarter to call my kernel like this :

``````mykernel<<Na*Nb,512>>(A, A2, B, C, d, Nc);
``````

To avoid this calculation in the kernel : index2 = blockIndex.x + blockDim.x * blockIndex.y;

Now my code looks like this.

``````__global__ void mykernel(float *A, , float *A2, float *B, float *C, float *d, int Nc)

{

__shared__ float TEMP1;

__shared__ float TEMP2;

__shared__ float b_val;

__shared__ float c_val;

int index2 = blockIndex.x;

/* get the values from the B and C matrix that are to be used in this block */

if (ti == 0) {

b_val = B[index2];

c_val = C[index2];

}

/* Set the initial values of the shared temporary array to 0 */

TEMP1[ti] = 0.0f;

TEMP2[ti] = 0.0f;

/* make sure b_val and c_val are initialized */

/* loop over all values in the d array for this complete threadblock */

for(offset = 0; offset < Nc; offset += 512) {

int index1 = ti + offset;

if (index1 < Nc) {

TEMP1[ti] += d[index1] * b_val;

TEMP2[ti] += d[index1] * c_val;

}

}

/* do a parallel reduction of the 512 elements (basically copied from the parallel reduction sheets from Mark Harris) */

if (tid < 256) {

TEMP1[tid] += TEMP1[tid + 256];

TEMP2[tid] += TEMP2[tid + 256];

}

if (tid < 128) {

TEMP1[tid] += TEMP1[tid + 128];

TEMP2[tid] += TEMP2[tid + 128];

}

if (tid < 64) {

TEMP1[tid] += TEMP1[tid + 64];

TEMP2[tid] += TEMP2[tid + 64];

}

if (tid < 32) {

TEMP1[tid] += TEMP1[tid + 32];

TEMP2[tid] += TEMP2[tid + 32];

TEMP1[tid] += TEMP1[tid + 16];

TEMP2[tid] += TEMP2[tid + 16];

TEMP1[tid] += TEMP1[tid + 8];

TEMP2[tid] += TEMP2[tid + 8];

TEMP1[tid] += TEMP1[tid + 4];

TEMP2[tid] += TEMP2[tid + 4];

TEMP1[tid] += TEMP1[tid + 2];

TEMP2[tid] += TEMP2[tid + 2];

TEMP1[tid] += TEMP1[tid + 1];

TEMP2[tid] += TEMP2[tid + 1];

}