I have a CUDA program that calls the kernel repeatedly within a for loop.
The code computes all rows of a matrix by using the values computed in
the previous one until the entire matrix is done. This is basically a
dynamic programming algorithm. The code below fills the (i,j) entry of
many separate matrices in parallel with the kernel.
for(i = 1; i <=xdim; i++){
for(j = 1; j <= ydim; j++){
gettimeofday(&start3time, NULL);
assign5<<<BLOCKS, THREADS>>>(Z, i, j, x, y, z);
gettimeofday(&end3time, NULL);
useconds=end3time.tv_usec-start3time.tv_usec;
printf("Time for i=%d j=%d is %ld\n", i, j, useconds);
}
}
The kernel assign5 is straightforward
global void assign5(float* Z, int i, int j, int x, int y, int z) {
int id = threadIdx.x + blockIdx.x * blockDim.x;
char ch = database[j + id];
Z[i+id] = (Z[x+id] + Z[y+id] + Z[z+id])*dev_matrix[i][index[ch - ‘A’]];
}
}
My problem is that when I run this program the time for each i and j is
3 or 4 microseconds for earlier iterations. Then the time
spikes to around 120 for the next three after which it is 8 or 9
for each iteration with occasional spikes. Below is the sample
output.
Time for i=18 j=41 is 3
Time for i=18 j=42 is 3
Time for i=18 j=43 is 32
Time for i=18 j=44 is 117
Time for i=18 j=45 is 118
Time for i=18 j=46 is 3
Time for i=18 j=47 is 9
Time for i=18 j=48 is 9
I don’t understand why this is happening. I don’t see a thread race condition. If I add
cudaThreadSynchronize();
right after the first loop then the time for all iterations is 3 or 4 microseconds. But
the sync command takes 700 microseonds. It seems like CUDA is performing many operations at
low cost in the earlier iterations and then charges a lot for later ones. Why does
it do this? Any help would be appreciated.
Also, the variables Z and database are device global arrays and dev_matrix and index and
device constant arrays. The memory access is coalesced. Thanks very much.