Hello,everyone

In this days, I try to call my kernel,use the<<<1,256>>> and <<<16,16>>>,I found that when i use the <<<1,256>>>,i can got the right result,but when I use the <<<16,16>>>,I can not got the right result,I got the answer as 1.#QNAN0 . I do not know why.

In the begin of my program,I define i as:

int i;
i=blockIdx.x*blockDim.x+threadIdx.x;

Thanks!

Hello,everyone

In this days, I try to call my kernel,use the<<<1,256>>> and <<<16,16>>>,I found that when i use the <<<1,256>>>,i can got the right result,but when I use the <<<16,16>>>,I can not got the right result,I got the answer as 1.#QNAN0 . I do not know why.

In the begin of my program,I define i as:

int i;

i=blockIdx.x*blockDim.x+threadIdx.x;

Thanks!

Try this.

case1:

[b]dim3 blockSize(256, 1, 1);

dim3 gridSize(1, 1, 1);

<<<gridSize, blockSize>>>

int i;

i=blockIdx.x*blockDim.x+threadIdx.x;[/b]

case2:

[b]dim3 blockSize(16, 1, 1);

dim3 gridSize(16, 1, 1);

<<<gridSize, blockSize>>>

int i;

i=blockIdx.x*blockDim.x+threadIdx.x;[/b]

Try this.

case1:

[b]dim3 blockSize(256, 1, 1);

dim3 gridSize(1, 1, 1);

<<<gridSize, blockSize>>>

int i;

i=blockIdx.x*blockDim.x+threadIdx.x;[/b]

case2:

[b]dim3 blockSize(16, 1, 1);

dim3 gridSize(16, 1, 1);

<<<gridSize, blockSize>>>

int i;

i=blockIdx.x*blockDim.x+threadIdx.x;[/b]

Thanks for your reply,but the problem remians after I try it. This is very strange.

tmurray
December 16, 2008, 5:44am
#4
you have a race condition between multiple blocks (or you aren’t handling multiple blocks correctly at all).

Right. Are you using shared memory? (That’s where such problems occur)

Thanks for your reply. Now,I still have this problem,and it driver me crazy!!!

This is my CUDA coda.Thanks for your advise.

[codebox]global void test(float *dforce,float *dvel,float *da,int *difix,float *dmass,float *fe,int *dX,float *dbigf,float *ddisp,float *ddelt,float *lentemp,float *strain,

```
float *Pstress,float area)
```

{

```
int i;
i=blockIdx.x*blockDim.x+threadIdx.x;
float E=30E6;
float densityo=0.000724;
if(i<NODNUM)
```

{

```
ddisp[i]=0.0E0;
```

}

__syncthreads();

int k=0;

while(k<STEP)

{

```
if(i<NODNUM)
dbigf[i]=0;
__syncthreads();
if(i<NODNUM)
{
lentemp[i]=0.0E0;
strain[i]=0.0E0;
Pstress[i]=0.0E0;
}
__syncthreads();
if(i<NODNUM-1)
{
lentemp[i]=dX[i+1]+ddisp[i+1]-dX[i]-ddisp[i];
ddelt[i]=0.7*lentemp[i]/sqrt(E/densityo);
strain[i]=(ddisp[i+1]-ddisp[i])/lentemp[i];
__syncthreads();
Pstress[i]=E*strain[i];
__syncthreads();
fe[i]=area*Pstress[i];
__syncthreads();
}
if(i==0)
{
dbigf[0]=-fe[0];
}
if(i==1)
{
dbigf[NODNUM-1]=fe[NODNUM-2];
}
if(i>0 && i<NODNUM-1)
dbigf[i]=fe[i-1]-fe[i];
__syncthreads();
da[i]=(dforce[i]-dbigf[i])/dmass[i];
if(difix[i]==1) da[i]=0;
dvel[i]=dvel[i]+ddelt[NODNUM-2]*da[i];
ddisp[i]=ddisp[i]+ddelt[NODNUM-2]*dvel[i];
__syncthreads();
k++;
```

}

}[/codebox]

Thanks for your reply. Now,I still have this problem,and it driver me crazy!!!

This is my CUDA coda.Thanks for your advise.

[codebox]

```
if(i<NODNUM-1)
{
lentemp[i]=dX[i+1]+ddisp[i+1]-dX[i]-ddisp[i];
ddelt[i]=0.7*lentemp[i]/sqrt(E/densityo);
strain[i]=(ddisp[i+1]-ddisp[i])/lentemp[i];
__syncthreads();
Pstress[i]=E*strain[i];
__syncthreads();
fe[i]=area*Pstress[i];
__syncthreads();
}
```

}[/codebox]

I think you are using way more syncthreads than needed. Apart from that: above you have a syncthreads that deadlocks. if i >= NODUM-1 threads are not going to the syncthreads(). The other threads are & are waiting indefinitely.