I am taking my first steps in CUDA programming, but even with manual and the many posts on this forum, it seems that something is eluding me.

I am running a nested loop of four levels (z=74,x=116,s=64,r=64). As a first test, I wanted to split it over several blocks and threads. I assigned the x-loop to blockIdx.x, the s-loop to blockIdx.y and the r-loop to threadIdx.x.

Host code:

```
dim3 grid(116,64,1); //working: dim3 grid(116,64,1);
dim3 threads(64,1,1); //working: dim3 threads(1,1,1);
calcTdomain<<< grid, threads >>>(dfactor,dP_data, droi);
cudaThreadSynchronize();
```

Device code:

(dP_data is an array with data, droi is the array in which results are written, dfactor is just a lookup table)

```
__global__ void calcTdomain(const float* dfactor, const float* dP_data, float* droi)
{
const int M = 116;
const int N = 64;
unsigned int izstart = 0;
unsigned int izend = 74;
unsigned int ixstart = blockIdx.x;
unsigned int ixend = ixstart+1;
unsigned int isstart= blockIdx.y;
unsigned int isend = isstart+1;
unsigned int irstart=threadIdx.x;
unsigned int irend=irstart+1;// irstart+1;
for(unsigned int iz=izstart; iz < izend; ++iz)
{
for(unsigned int ix=ixstart; ix < ixend; ++ix)
{
for(unsigned int is=isstart; is < isend; ++is)
{
for(unsigned int ir=irstart; ir < irend; ++ir)
{
droi[iz*M+ix] = droi[iz*M+ix] +
dP_data[ int(dfactor[iz*M*N*N*2 + ix*N*N*2 + is*N*2 +ir*2])*N*N+is*N+ir]
* dfactor[iz*M*N*N*2 + ix*N*N*2 + is*N*2 +ir*2+1];
}//ir
}//is
}//ix
}//iz
}
```

In emulation mode, the results as written in droi are correct and I checked with printf-statements that the correct z,x,s,r values are used. On the graphics card itself, however, the results are wrong. It seems as if only a bit is written away.

I am using a Geforce 8800 Ultra with the CUDA 1.0 SDK.

Any help, tips or tricks would be very much appreciated.