Device pointers arithmetic and precision problem with the size of the integer values inside the poin

I noticed that while running in the device emulation mode my code is running fine and once I try it in the real mode it hangs. It turned out that while the two statement setting the memory in the next code should be equivalent they are not when the value for the addr pointer is above a certain limit. Adding an if statement to check if there is an actual different between the addr and &(data_mem[(ax+x) + 216 * ((ay+y) + 288 * (az+z))]), I found out that the host do not see any difference while the device sees a difference as I mentioned for the large values of addr. My guess is that certain pointer arithmetic are really performed for 24 bit not 32. Even if I used an long integer and do the computation for the index not the address so that I access the data as data_mem[addr], the problems are the same. Could you please provide any description about the precision used for the pointer arithmetics and if their is a good performing solution to my problem.

Thanks in advance for any help you can provide.

int az = (data_index & 15) * 18;

		int ay = ((data_index >> 4) & 15) * 18;

		int ax = ((data_index >> 8) & 15) * 18;

				float4* addr = &(data_mem[(ax+wx) + 216 * ((ay+ewy) + 288 * (az+ewz))]);

		for (int x = wx; x < ewx; x++)

		{

			addr -= (ewy - wy) * 216;

			for (int y = wy; y < ewy; y++)

			{

				addr -= (ewz - wz) * 62208;

				for (int z = wz; z < ewz; z++)

				{

					// set the value in the memory

					(*addr) = make_float4(tps.y,tps.x,tps.z,0.0);

					// should be equivalent to => data_mem[(ax+x) + 216 * ((ay+y) + 288 * (az+z))] = make_float4(tps.y,tps.x,tps.z,0.0);

					addr += 62208;

				}

				addr += 216;

			}

			addr += 1;

		}

In case this will help: wx, wy, wz, ewx, ewy, ewz are in the range [0, 17]. Problems disappear (emulation and real) when the range is set to be [0, 16] thus excluding 17, data_index is in [0, 3071]

It turned out to be a compiler bug in the arithmetic. It seems that the compiler is reading a variable incorrectly as another due to an optimization. I will file a bug report soon when I get some free time. By the way, this code was used with OpenGL. When you run the code in emulation you see an object and when you run it in real mode the object disappears. I had no shared variables between threads in case you want to know. And this is how some of the variables are initialized.

int wx = floorf(worker_index / 9) * 6;

int wy = floorf((worker_index % 9) / 3) * 6;

int wz = (worker_index % 3) * 6;

int ewx = wx + 6;

int ewy = wy + 6;

int ewz = wz + 6;

where worker_index is between [0,27]

See http://forums.nvidia.com/index.php?showtopic=89307