Kernel execution time is constant is it logic?

raikkonnen · November 25, 2010, 10:02am

Hello,

I use CUDA to code the problem of longest commun subsequence.
After execution, i noticed that the kernel execution time (i placed cutGetTimerValue(timer) before cudaMemcpy) was constant despite the increment of the size of the two sequences (in fact, my problem is to get the longest commun subsequence for 2 sequences; this problem is used in bioinformatic)

So is it logic to have constant kernel execution time External Image ?
or is it a problem of competition between threads to access to global memory.

is there any documentation about this phenomenen?

Thanks for your help in advance External Image

raikkonnen · November 25, 2010, 10:02am

Hello,

I use CUDA to code the problem of longest commun subsequence.
After execution, i noticed that the kernel execution time (i placed cutGetTimerValue(timer) before cudaMemcpy) was constant despite the increment of the size of the two sequences (in fact, my problem is to get the longest commun subsequence for 2 sequences; this problem is used in bioinformatic)

So is it logic to have constant kernel execution time External Image ?
or is it a problem of competition between threads to access to global memory.

is there any documentation about this phenomenen?

Thanks for your help in advance External Image

avidday · November 25, 2010, 10:15am

It is probably only your timing that is wrong. Kernel launches are asynchronous, so you might well be only measuring the kernel launch time and not the execution time. Add a cudaThreadSynchronize() call before you read your timer, this will force the host to spinlock until the kernel has finished executing.

As an aside, the correct spelling of your screen name is RÃ¤ikkÃ¶nen…

avidday · November 25, 2010, 10:15am

It is probably only your timing that is wrong. Kernel launches are asynchronous, so you might well be only measuring the kernel launch time and not the execution time. Add a cudaThreadSynchronize() call before you read your timer, this will force the host to spinlock until the kernel has finished executing.

As an aside, the correct spelling of your screen name is RÃ¤ikkÃ¶nen…

raikkonnen · November 25, 2010, 10:51am

thanks a lot for your help but now i have another problem because my timing was wrong so i have to reduce the execution time.

this my kernel

__global__ void

LCS( int* C, char* A, char* B, int wA, int wB)

{

int wC=wA+1;

int current_Index = blockIdx.x * blockDim.x + threadIdx.x;

if (current_Index<wA)

			{

				j=current_Index;

				i=0;

				mn=min(current_Index,wB);

				while((i<=mn)&&(j>=0))

				{

					if (i==0 || j==0)

					{   

						C[i*wC+j]=0;

					}

					else if (B[i-1]==A[j-1])

						

					{

						C[i*wC+j]=C[(i-1)*wC+j-1]+1;

					}

						else

						//max(C[i-1][j],C[i][j-1]);

						{		

							C[i*wC+j]=max(C[i*wC+j-1],C[(i-1)*wC+j]);

						}				

					if (i<=mn)

						i=i+1;

					if (j>=0)

						j=j-1;	

			}

			__syncthreads();

		}

			if(current_Index<=wB)

			{

				j=wA;

				i=current_Index;

				while(i<=wB)

				{

					if (i==0 || j==0)

					{

					    	C[i*wC+j]=0;

					}

					else if (A[j-1]==B[i-1])

					{

						C[i*wC+j]=C[(i-1)*wC+j-1]+1;

					}

						else

						{

							C[i*wC+j]=max(C[i*wC+j-1],C[(i-1)*wC+j]);

						}

					i=i+1;

					j=j-1;

				}

			

			__syncthreads();

			}

}

the algorithm for resolving the LCS problem is to full-fill a matrix with the length of the longest subsequence (in the last cell of the matrix C[N*M])

the parallelism exists between the anti-diagonals i can compute C[i][j] if i knowa C[i][j-1],C[i-1][j-1]and C[i-1][j] (the north,the west and the north-west cells)

raikkonnen · November 25, 2010, 10:51am