Performance hit in CUDA program that calls kernel repeatedly within a for loop

I have a CUDA program that calls the kernel repeatedly within a for loop.
The code computes all rows of a matrix by using the values computed in
the previous one until the entire matrix is done. This is basically a
dynamic programming algorithm. The code below fills the (i,j) entry of
many separate matrices in parallel with the kernel.

for(i = 1; i <=xdim; i++){

for(j = 1; j <= ydim; j++){

gettimeofday(&start3time, NULL);
assign5<<<BLOCKS, THREADS>>>(Z, i, j, x, y, z);
gettimeofday(&end3time, NULL); 
useconds=end3time.tv_usec-start3time.tv_usec;  
printf("Time for i=%d j=%d is %ld\n", i, j, useconds); 

}

}

The kernel assign5 is straightforward

global void assign5(float* Z, int i, int j, int x, int y, int z) {

int id = threadIdx.x + blockIdx.x * blockDim.x;

char ch = database[j + id];

Z[i+id] = (Z[x+id] + Z[y+id] + Z[z+id])*dev_matrix[i][index[ch - ‘A’]];

}

}

My problem is that when I run this program the time for each i and j is
3 or 4 microseconds for earlier iterations. Then the time
spikes to around 120 for the next three after which it is 8 or 9
for each iteration with occasional spikes. Below is the sample
output.

Time for i=18 j=41 is 3
Time for i=18 j=42 is 3
Time for i=18 j=43 is 32
Time for i=18 j=44 is 117
Time for i=18 j=45 is 118
Time for i=18 j=46 is 3
Time for i=18 j=47 is 9
Time for i=18 j=48 is 9

I don’t understand why this is happening. I don’t see a thread race condition. If I add

cudaThreadSynchronize();

right after the first loop then the time for all iterations is 3 or 4 microseconds. But
the sync command takes 700 microseonds. It seems like CUDA is performing many operations at
low cost in the earlier iterations and then charges a lot for later ones. Why does
it do this? Any help would be appreciated.

Also, the variables Z and database are device global arrays and dev_matrix and index and
device constant arrays. The memory access is coalesced. Thanks very much.

The time measurement is wrong:

gettimeofday(&start3time, NULL);
assign5<<<BLOCKS, THREADS>>>(Z, i, j, x, y, z); // after this call the control is given back to the cpu, it never computes the real time.
gettimeofday(&end3time, NULL);

You need to sue something like this:

cudaEvent_t start,stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
float gputime;

// here start the loop
{
cudaEventRecord(start,0);

// do the gpu work and related calls

cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&gputime,start,stop);

printf("Time = %g \n", gputime/1000.0f);

}

cudaEventDestroy(start);
cudaEventDestroy(stop) ;

Thanks, I’m getting consistent timing results now. While doing research on this topic
I came across threads in this forum which discuss the CPU launch kernel time. I rewrote
my program to make fewer kernel calls and this indeed reduces the running time.