hi all,
I am trying to write a program that reads and write in same array. for example my cuda kernel code is as below…
global void floydWarshall_kernel( int* array, int n)
{
int k;
int n=numberOfNodes;
int j = threadIdx.x;
int i = threadIdx.y;
if((i<n)&&(j<n)&&(i!=j))
{
for (k = 0; k < n; ++k)
{
int ij,ik, kj;
ij=graph[j*n+i];
ik=graph[k*n+i];
kj=graph[j*n+k];
if ((ik * kj!= 0) && (i != j))
if ((ij>=ik+ kj) || (ij == 0))
graph[j*n+i]= ik+kj;
}
}
}
with this kernel i am not getting the expected result. I guess there is some race condition among threads that read and write data from same memory address. how to avoid the race condition. I read that using __syncthread() function we can avoid race condition. Can anyone clarify how and where to use the function in above kernel? Also the function synchronizes threads only within the same blocks. how to avoid race conditions between threads that resides in different blocks?
it is giving me correct result. Also for n=1000, the gpu execution is abt 150 times faster. but as i increase n the, the performance of GPU degrades, for example for n=4000, GPU is only 4 times faster than cpu version. is it due to increased global memory access?? If so how can it be reduced?
void floydWarshall_cpu( int *A, int n )
{
int i, j, k;
for( i = 0; i < n; i++){
for( j = 0; j < n; j++){
if ( i != j ){
for (k = 0; k < n; ++k){
int ij,ik, kj;
ij = A[j*n+i];
ik = A[k*n+i];
kj = A[j*n+k];
if ((ik * kj!= 0) && (i != j))
if ((ij>=ik+ kj) || (ij == 0))
A[j*n+i]= ik+kj;
}// for k
}
}// for j
}// for i
}
This has race-condition since you has updated original matrix A(s,t) for (s,t) < (i,j).
consider another program
void floydWarshall_cpu_v2( int *A, int *B, int n )
{
int i, j, k;
for( i = 0; i < n; i++){
for( j = 0; j < n; j++){
if ( i != j ){
for (k = 0; k < n; ++k){
int ij,ik, kj;
ij = A[j*n+i];
ik = A[k*n+i];
kj = A[j*n+k];
if ((ik * kj!= 0) && (i != j))
if ((ij>=ik+ kj) || (ij == 0))
B[j*n+i]= ik+kj;
}// for k
}else{
B[j*n+i]= A[j*n+i];
}
}// for j
}// for i
}