CUDA for statistical process - CPU code faster than GPU code

Hi, Im working on a genetic algorithm that basically tries to calculate a moving window score over 2d matrices.

The CPU implementation of the algorithm is fairly simple: This function is called in a double for loop in the main for a fixes i and 400 j values

int main
[indent]double score(int **r, double **theta, int i, int j)
int k;
double sum=0.0;
for(k=0;k<w;k++) sum += log( theta[ r[i][j+k] ][k]/ rbackground[i][j+k] + 10e-8); ;

I tried to implement the same in Cuda and found that it ran slower even though, it should calculate the sum for each j value independantly. Heres my cuda code:

global void GPUscore(int* rd, float* rbackgroundd, float* thetad, float* rowscored, int w, int seqlength)
// Block index
int bx = blockIdx.x;
int by = blockIdx.y;
// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;
shared float temp;
temp = 0;

for(int k=0;k<w;k++)  
	if(3*by*200+bx*200+tx+ty*20 < seqlength-w+1)
		temp += log(thetad[(w*rd[3*by*200+bx*200+tx+ty*20])+k]/ rbackgroundd[3*by*200+bx*200+tx+ty*20] + 10e-8);
rowscored[3*by*200+bx*200+tx+ty*20] = temp;

void scorecalc(int **r, double **theta, int i)
int size;
int *rtemp = new int[seqlength[i]];
float *rbacktemp = new float[seqlength[i]];
float rowscoretemp = new float[seqlength[i]];
float thetatemp = new float[4w];
for (int k=0; k<4; k++)
for (int l=0; l<w; l++)
w +l] = theta[k][l];
for (int m=0; m<seqlength[i]; m++)
rowscoretemp[m] =0;
rtemp[m] = r[i][m];
rbacktemp[m] = rbackground[i][m];

// Load rd and rbackgroundd to the device
int* rd;
size = seqlength[i] * sizeof(int);
cudaMalloc((void**)&rd, size);
cudaMemcpy(rd, rtemp, size, cudaMemcpyHostToDevice);
float* thetad;
size = 4 * w * sizeof(float);
cudaMalloc((void**)&thetad, size);
cudaMemcpy(thetad, thetatemp, size, cudaMemcpyHostToDevice);
float* rowscored;
size = seqlength[i] * sizeof(float);
cudaMalloc((void**)&rowscored, size);
cudaMemcpy(rowscored, rowscoretemp, size, cudaMemcpyHostToDevice);
float* rbackgroundd;
size = seqlength[i] * sizeof(float);
cudaMalloc((void**)&rbackgroundd, size);
cudaMemcpy(rbackgroundd, rbacktemp, size, cudaMemcpyHostToDevice);
dim3 threads(20,10);
dim3 grid(3, 1);

GPUscore<<<grid,threads>>>(rd, rbackgroundd, thetad, rowscored, w, seqlength[i]);
cudaMemcpy(rowscoretemp, rowscored, size, cudaMemcpyDeviceToHost);
//cudaMemcpy(rbacktemp, rbackgroundd, size, cudaMemcpyDeviceToHost);

for (int n=0; n<seqlength[i]; n++)
rowscore[i][n] = rowscoretemp[n];

delete thetatemp;
delete rtemp;
delete rbacktemp;
delete rowscoretemp;

Any help is appreciated.

  • VN

Im not sure which GPU you have, but with this grid configuration (3,1), you are only using 3 MPs on the graphics card, which most likely has more than that.

Your use of syncthreads and shared variables seem to be off. Looking at your original C code, temp should not be shared and that syncthread is superfluous.

blockDim of 200 (20,10) is NOT a multiple of 32.

So, you are right away wasting a warp. 6*32 = 192…

So each time the last warp is scheduled only 8 threads out of 32 participate. 75% of that WARP is wasted.