It is from matlab code, but I will try to make it clear ;)
Na is typically 25, Nb is typically 20, but we might want to increase these values when the CUDA code is orders of a magnitude faster than our current matlab code. These values are kept low to keep the processing time acceptable.
Nc is typically around 40000.
A = zeros(Na,Nb);
A2 = zeros(Na,Nb);
B = rand(Na,Nb);
C = rand(Na,Nb);
d = rand(1,Nc);
for k = 1 : Nc
A = A + d(k)*B;
A2 = A2 + d(k)*C;
end
So basically I am generating 40000-ish matrices and generating the sum of them all in the third dimension, so I am left with one matrix of that same size.
My idea was to do the following:
Generate a grid of dimension Na,Nb, with 512 threads each. And generate 511 values in a shared array of size 512. Set the first element in the shared array to 0 (first time) or to the sum up till now. And then in a for-loop I do an offset += 511; until I have had all the elements of my d vector
Code_written_right_now follows (I will try the code Friday when I have access to my machine again, but if people have suggestions of how to do it better, I might write smarter code tomorrow…) So it might be that there are some errors/typo’s in the code.
__global__ void mykernel(float *A, , float *A2, float *B, float *C, float *d, int Nc, int offset)
__shared__ float TEMP1[512];
__shared__ float TEMP2[512];
int ti = threadIndex.x;
int index1 = threadIndex.x+ offset -1;
int index2 = blockIndex.x + blockIndex.y * blockDim.x;
if (ti==0) { /* Initialise the first element to the appropriate value */
if (offset==0) {
TEMP1[ti] = 0.0f;
TEMP2[ti] = 0.0f;
} else {
TEMP1[0] = A[index2];
TEMP2[0] = A2[index2];
}
} else { /* fill the 511 remaining elements with calculated values */
if (index1 <Nc) {
TEMP1[ti] = d[index1] * B[index2];
TEMP2[ti] = d[index1] * C[index2];
} else { /* unless there are no values in my d array anymore */
TEMP1[ti] = 0.0f;
TEMP2[ti] = 0.0f;
}
}
/* do a parallel reduction of the 512 elements (first element is 0, or the sum up to now) (basically copied from the parallel reduction sheets from Mark Harris) */
__syncthreads();
if (tid < 256) {
TEMP1[tid] += TEMP1[tid + 256];
TEMP2[tid] += TEMP2[tid + 256];
}
__syncthreads();
if (tid < 128) {
TEMP1[tid] += TEMP1[tid + 128];
TEMP2[tid] += TEMP2[tid + 128];
}
__syncthreads();
if (tid < 64) {
TEMP1[tid] += TEMP1[tid + 64];
TEMP2[tid] += TEMP2[tid + 64];
}
__syncthreads();
if (tid < 32) {
TEMP1[tid] += TEMP1[tid + 32];
TEMP2[tid] += TEMP2[tid + 32];
TEMP1[tid] += TEMP1[tid + 16];
TEMP2[tid] += TEMP2[tid + 16];
TEMP1[tid] += TEMP1[tid + 8];
TEMP2[tid] += TEMP2[tid + 8];
TEMP1[tid] += TEMP1[tid + 4];
TEMP2[tid] += TEMP2[tid + 4];
TEMP1[tid] += TEMP1[tid + 2];
TEMP2[tid] += TEMP2[tid + 2];
TEMP1[tid] += TEMP1[tid + 1];
TEMP2[tid] += TEMP2[tid + 1];
}
__syncthreads();
/* store to global memory */
if (tid == 0) {
A[index2] = TEMP1[0];
A2[index2] = TEMP2[0];
}
And then do something in my calling c-code like:
gridDim.x = Na;
gridDim.y = Nb;
for(offset = 0; offset < Nc; offset += 511)
mykernel<<gridDim,512>>(A, A2, B, C, d, Nc, offset);
Any comments on what I could do better. Is it smart to make my grid the same size as my matrix? It certainly keeps it more simple for a novice in CUDA like me, and I guess the fact that I use the same index into my global arrays A, A2, B and C in each block might help performance. But especially the first part where I have nested if-else constructs makes me feel a bit suspicious about the performance.
I must say, the last few days I have been trying to adjust the scanLargeArray example to fit my needs, but that has been an uneasy exercise I must say, so I definitely have to get up to speed in thinking parallel & thinking CUDA-optimized :D
greetz,
Dènis