I use CUDA to code the problem of longest commun subsequence.
After execution, i noticed that the kernel execution time (i placed cutGetTimerValue(timer) before cudaMemcpy) was constant despite the increment of the size of the two sequences (in fact, my problem is to get the longest commun subsequence for 2 sequences; this problem is used in bioinformatic)
So is it logic to have constant kernel execution time External Image ?
or is it a problem of competition between threads to access to global memory.
I use CUDA to code the problem of longest commun subsequence.
After execution, i noticed that the kernel execution time (i placed cutGetTimerValue(timer) before cudaMemcpy) was constant despite the increment of the size of the two sequences (in fact, my problem is to get the longest commun subsequence for 2 sequences; this problem is used in bioinformatic)
So is it logic to have constant kernel execution time External Image ?
or is it a problem of competition between threads to access to global memory.
It is probably only your timing that is wrong. Kernel launches are asynchronous, so you might well be only measuring the kernel launch time and not the execution time. Add a cudaThreadSynchronize() call before you read your timer, this will force the host to spinlock until the kernel has finished executing.
As an aside, the correct spelling of your screen name is Räikkönen…
It is probably only your timing that is wrong. Kernel launches are asynchronous, so you might well be only measuring the kernel launch time and not the execution time. Add a cudaThreadSynchronize() call before you read your timer, this will force the host to spinlock until the kernel has finished executing.
As an aside, the correct spelling of your screen name is Räikkönen…
thanks a lot for your help but now i have another problem because my timing was wrong so i have to reduce the execution time.
this my kernel
__global__ void
LCS( int* C, char* A, char* B, int wA, int wB)
{
int wC=wA+1;
int current_Index = blockIdx.x * blockDim.x + threadIdx.x;
if (current_Index<wA)
{
j=current_Index;
i=0;
mn=min(current_Index,wB);
while((i<=mn)&&(j>=0))
{
if (i==0 || j==0)
{
C[i*wC+j]=0;
}
else if (B[i-1]==A[j-1])
{
C[i*wC+j]=C[(i-1)*wC+j-1]+1;
}
else
//max(C[i-1][j],C[i][j-1]);
{
C[i*wC+j]=max(C[i*wC+j-1],C[(i-1)*wC+j]);
}
if (i<=mn)
i=i+1;
if (j>=0)
j=j-1;
}
__syncthreads();
}
if(current_Index<=wB)
{
j=wA;
i=current_Index;
while(i<=wB)
{
if (i==0 || j==0)
{
C[i*wC+j]=0;
}
else if (A[j-1]==B[i-1])
{
C[i*wC+j]=C[(i-1)*wC+j-1]+1;
}
else
{
C[i*wC+j]=max(C[i*wC+j-1],C[(i-1)*wC+j]);
}
i=i+1;
j=j-1;
}
__syncthreads();
}
}
the algorithm for resolving the LCS problem is to full-fill a matrix with the length of the longest subsequence (in the last cell of the matrix C[N*M])
the parallelism exists between the anti-diagonals i can compute C[i][j] if i knowa C[i][j-1],C[i-1][j-1]and C[i-1][j] (the north,the west and the north-west cells)
thanks a lot for your help but now i have another problem because my timing was wrong so i have to reduce the execution time.
this my kernel
__global__ void
LCS( int* C, char* A, char* B, int wA, int wB)
{
int wC=wA+1;
int current_Index = blockIdx.x * blockDim.x + threadIdx.x;
if (current_Index<wA)
{
j=current_Index;
i=0;
mn=min(current_Index,wB);
while((i<=mn)&&(j>=0))
{
if (i==0 || j==0)
{
C[i*wC+j]=0;
}
else if (B[i-1]==A[j-1])
{
C[i*wC+j]=C[(i-1)*wC+j-1]+1;
}
else
//max(C[i-1][j],C[i][j-1]);
{
C[i*wC+j]=max(C[i*wC+j-1],C[(i-1)*wC+j]);
}
if (i<=mn)
i=i+1;
if (j>=0)
j=j-1;
}
__syncthreads();
}
if(current_Index<=wB)
{
j=wA;
i=current_Index;
while(i<=wB)
{
if (i==0 || j==0)
{
C[i*wC+j]=0;
}
else if (A[j-1]==B[i-1])
{
C[i*wC+j]=C[(i-1)*wC+j-1]+1;
}
else
{
C[i*wC+j]=max(C[i*wC+j-1],C[(i-1)*wC+j]);
}
i=i+1;
j=j-1;
}
__syncthreads();
}
}
the algorithm for resolving the LCS problem is to full-fill a matrix with the length of the longest subsequence (in the last cell of the matrix C[N*M])
the parallelism exists between the anti-diagonals i can compute C[i][j] if i knowa C[i][j-1],C[i-1][j-1]and C[i-1][j] (the north,the west and the north-west cells)