Different results emu versus real runs

I am taking my first steps in CUDA programming, but even with manual and the many posts on this forum, it seems that something is eluding me.

I am running a nested loop of four levels (z=74,x=116,s=64,r=64). As a first test, I wanted to split it over several blocks and threads. I assigned the x-loop to blockIdx.x, the s-loop to blockIdx.y and the r-loop to threadIdx.x.

Host code:

dim3 grid(116,64,1);  	//working: dim3 grid(116,64,1);

	dim3 threads(64,1,1);  	//working: dim3 threads(1,1,1);

	calcTdomain<<< grid, threads >>>(dfactor,dP_data, droi);


Device code:

(dP_data is an array with data, droi is the array in which results are written, dfactor is just a lookup table)

__global__ void calcTdomain(const float* dfactor, const float* dP_data, float* droi)


	const int M = 116;

	const int N = 64;

	unsigned int izstart = 0;

	unsigned int izend = 74;

	unsigned int ixstart = blockIdx.x;

	unsigned int ixend = ixstart+1;

	unsigned int isstart= blockIdx.y;

	unsigned int isend = isstart+1;

	unsigned int irstart=threadIdx.x;

	unsigned int irend=irstart+1;// irstart+1;

	for(unsigned int iz=izstart; iz < izend; ++iz)


  for(unsigned int ix=ixstart; ix < ixend; ++ix)


  	for(unsigned  int is=isstart; is < isend; ++is)


    for(unsigned int ir=irstart; ir < irend; ++ir)


    	droi[iz*M+ix] = droi[iz*M+ix] + 

        dP_data[ int(dfactor[iz*M*N*N*2 + ix*N*N*2 + is*N*2 +ir*2])*N*N+is*N+ir] 

        * dfactor[iz*M*N*N*2 + ix*N*N*2 + is*N*2 +ir*2+1];






In emulation mode, the results as written in droi are correct and I checked with printf-statements that the correct z,x,s,r values are used. On the graphics card itself, however, the results are wrong. It seems as if only a bit is written away.

I am using a Geforce 8800 Ultra with the CUDA 1.0 SDK.

Any help, tips or tricks would be very much appreciated.

The summation was done linearly by the emulator, but this is not the case in the case on the graphics card itself.