GPU Global Memory issue of accessing in loop

I am using a Matrix of [980*1660] = 1635100 cells and Array of Float of [1635100] on GPU global memory

long newIndex =0;
float neighbourMinVal = 1000000;
float currNeigbourValue;
long topLeft = currThreadID - size_X -1;// sizeX=980 goes to top row && -1 bcoz to got to left corner of window(size 8) 
for(int k=0; k<9; k+=1) 
{
	newIndex = getNewIndex(topLeft, k,size_X);			
	if(!(newIndex == currThreadID || newIndex < 0 || newIndex >=size_Mat))
	{
		currNeigbourValue = planchonMatrix[newIndex];
		if( currNeigbourValue < neighbourMinVal )
                {
	        	neighbourMinVal = currNeigbourValue;
						
		}
}

Question is:
neighbourMinVal is not returning Minimum correct value after the loop. It is giving me 1000000 most of the time. But If i use Printf("neighbourMinVal ")or anything in the loop to print each value of iteration then it gives me correct minimum value.
I want to know why? and how can i solve it?

is this device code or host code?
and are you running a debug build or release build when referencing the results you get?

i can much understand why you would get what you get, if this is run as a release build, and the code is device code

a) neighbourMinVal is a local variable, and b) within a loop

the optimizer may very well only consider the last iteration of the loop then, as there seem to be little iteration dependency, and no intermediate storage to a ‘fixed’ location, like global/ shared memory

either that, or you may have a race; which is difficult to determine, as you only posted half of the if section/ for loop

This is Kernal Code in running Debug mode.

After this for loop every thread is assigning corresponding nmin Value.

long newIndex =0;
float neighbourMinVal = 1000000;
float currNeigbourValue;
long topLeft = currThreadID - size_X -1;// sizeX=980 goes to top row && -1 bcoz to got to left corner of window(size 8) 
for(int k=0; k<9; k+=1) 
{
	newIndex = getNewIndex(topLeft, k,size_X);			
	if(!(newIndex == currThreadID || newIndex < 0 || newIndex >=size_Mat))
	{
		currNeigbourValue = planchonMatrix[newIndex];
		if( currNeigbourValue < neighbourMinVal )
                {
	        	neighbourMinVal = currNeigbourValue;
						
		}
                Array [ThreadID] = neighbourMinVal ;
}

Array is in Global Memory.

Here is full code of kernel:

float neighbourMinVal = 1000000;
long topLeft = currThreadID - size_X -1;//-sizeX goes to top && -1 bcoz to got to left corner 
long newIndex =0;
float currNeigbourValue;
for(int k=0; k<9; k+=1) 
{
    newIndex = getNewIndex(topLeft, k,size_X);	//getting Global Index of thread for iteration k
    if(!(newIndex == currThreadID || newIndex < 0 || newIndex >=size_Mat))
    {
	currNeigbourValue = Matrix[newIndex];
	if( currNeigbourValue < neighbourMinVal )
	neighbourMinVal = currNeigbourValue;
				
    }
     if(neighbourMinVal < Matrix[currThreadID ])
        Matrix[currThreadID ] = currNeigbourValue ;
}

When seeking help in debugging one’s code, it is always a good idea to post a complete, compilable and runnable example code. The code posted cannot be the complete kernel. For example, there is no global function shown; currThreadID and size_Mat do not seem to be defined anywhere. Your problem may also be rooted in the host portion of your code (e.g. failed memory allocation, invalid launch configuration), which is not shown.

currThreadID == ?; i.e. what do you assign to currThreadID/ how do you calculate currThreadID?

what kernel dimensions are you using?

and size_X equals…?