memory out of bound error using cuda-memcheck

I have a structure defined as

struct CDistance
{
Scalar Magnitude;
Scalar Direction;
};

and have the kernel as

__global__ static void  CalcSTLDistance_Kernel(Integer ComputeParticleNumber)
{
	const Integer TID = CudaGetTargetID();
	const Integer ID  = TID;
	if(ID >= ComputeParticleNumber)
	{
		return ;
	}
	CDistance NearestDistance;
	Integer NearestID = -1;
	NearestDistance.Magnitude = 1e8;
	NearestDistance.Direction = make_Scalar3(0,0,0);
	if(c_daOutputParticleID[ID] < -1)
	{
		c_daSTLDistance[ID] = NearestDistance;
		c_daSTLID[ID] = NearestID;
		return;
	}

	Scalar3 TargetPosition = c_daParticlePosition[ID];

	Integer TriangleID;		
	Integer CIDX, CIDY, CIDZ;
	Integer CID = GetCellID(&CONSTANT_BOUNDINGBOX,&TargetPosition,CIDX, CIDY, CIDZ);
	if(CID >=0 && CID < c_CellNum)
	{
		Integer Range = 1;
		for(Integer k = -Range; k <= Range; ++k)
		{
			for(Integer j = -Range; j <= Range; ++j)
			{
				for(Integer i = -Range; i <= Range; ++i)
				{
					Integer MCID = GetCellID(&CONSTANT_BOUNDINGBOX,CIDX +i, CIDY + j,CIDZ + k);
					if(MCID < 0 || MCID >= c_CellNum)
					{
						continue;
					}
					unsigned int TriangleNum = c_daCell[MCID].m_TriangleNum;
					for(unsigned int l = 0; l < TriangleNum; ++l)
					{
						TriangleID = c_daCell[MCID].m_TriangleID[l];
						if(c_daTrianglesParameters[c_daTriangles[TriangleID].ModelIDNumber].isDrag)
						{
							continue;
						}

						if( TriangleID >= 0 && TriangleID < c_TriangleNum && TriangleID != NearestID)// No need to calculate again for the same triangle
						{
							CDistance Distance ;
							Distance.Magnitude = CalcDistance(&c_daTriangles[TriangleID], &TargetPosition, &Distance.Direction);
							if(Distance.Magnitude < NearestDistance.Magnitude)
							{
								NearestDistance = Distance;
								NearestID = TriangleID;
							}
						}
					}	
				}
			}
		}
	}
	c_daSTLDistance[ID] = NearestDistance;
	c_daSTLID[ID] = NearestID;
}

the problem here is while checking with cuda-memcheck, it gives memory out of bound error in this kernel. while debugging in to this kernal , I further found that the actual error lies within

if( TriangleID >= 0 && TriangleID < c_TriangleNum && TriangleID != NearestID)// No need to calculate again for the same triangle
						{
							CDistance Distance ;
							Distance.Magnitude = CalcDistance(&c_daTriangles[TriangleID], &TargetPosition, &Distance.Direction);
							if(Distance.Magnitude < NearestDistance.Magnitude)
							{
								NearestDistance = Distance;
								NearestID = TriangleID;
							}
						}

within this code the variables defined above this code block cannot be accessed[ like NearestDistance object is defined and values are assigned to it above this code but this object parameter cannot be accessed within this block]. only those variables and object are accessed which are defined within the curly braces.

example:

if( TriangleID >= 0 && TriangleID < c_TriangleNum && TriangleID != NearestID)// No need to calculate again for the same triangle
						{
									CDistance Distance,NearestDistance ;
									
									NearestDistance.Magnitude=2;
									Distance.Magnitude =1;
									if(Distance.Magnitude < NearestDistance.Magnitude)
									{
										NearestDistance = Distance;
										
										
									}
runs fine without memory error

but code

if( TriangleID >= 0 && TriangleID < c_TriangleNum && TriangleID != NearestID)// No need to calculate again for the same triangle
						{
									CDistance Distance ;
									
									NearestDistance.Magnitude=2;
									Distance.Magnitude =1;// CalcDistance(&c_daTriangles[TriangleID], &TargetPosition, &Distance.Direction);
									if(Distance.Magnitude < NearestDistance.Magnitude)
									{
										NearestDistance = Distance;
										NearestDistance.Magnitude=TriangleID;
										//NearestID = TriangleID ;
									}

shows error as the object NearestDistance is not defined in the blocks but outside of it.

what could be the cause for such kind of unexpected error?

what kernel grid and block dimensions are you using?

I am using grid and block as
CalcSTLDistance_Kernel<<<Total/256,256,0,Stream>>>(ComputeParticleNum);

I think the code - kernel - would mostly work for a thread block size of 1, and not for any block size greater than that; you can actually test this - see if it works for a block size of 1

At first glance, this looks like a serial implementation (cpu code), not a parallel implementation (gpu code)
This is based on the fact that there is virtually no synchronization present within the kernel, and variable declaration (most (all) the variables/ objects are generally local variables)
If this is coherent with your design, then fine; otherwise this would spell exactly the kind of behavior you are seeing
Make sure which threads you want to work on what and access which variables at any particular time

I have used

Status = this->Synchronize();
CCT_ERROR_CHECK(Status);

after this kernel is called as

Status = this->CalculateSTLDistance();
		CCT_ERROR_CHECK(Status);
		std::cout <<"**RDDev: "<< m_DeviceSelector[0] <<" :Step: " << i << " :M: "<< m_MaxParticleNum <<" :P: " << m_ParticleNum << " :T: " << m_TriangleNum <<" : CALC STL DIST\n";
		
		if(m_DragTriangleNum > 0)		
		{
			Status = this->CalcDragEffect();
			CCT_ERROR_CHECK(Status);
			std::cout <<"**RDDev: "<< m_DeviceSelector[0] <<" :Step: " << i << " :M: "<< m_MaxParticleNum <<" :P: " << m_ParticleNum << " :T: " << m_TriangleNum <<" : CALC Drag\n";
Status = this->Synchronize();
		CCT_ERROR_CHECK(Status);

and during execution it shows unknown error at this synchronize()

StreamSynchronize()
{
	CCTStatusType Status = CCT_NOERR;
	Status = CudaSafeCall(cudaSetDevice(m_DeviceID));
	CCT_ERROR_CHECK(Status);
	Status = CudaSafeCall(cudaSetDevice(m_DeviceID));
	CCT_ERROR_CHECK(Status);
	Status = CudaSafeCall(cudaStreamSynchronize(m_CudaStream));
	CCT_ERROR_CHECK(Status);
	return Status;
}

I was referring to __syncthreads() mostly

within your kernel: CalcSTLDistance_Kernel, you have a number of function calls
You also set up variables to be passed to these functions

At this point, each thread within a thread block executing the particular kernel, would individually and serially do the above: set up variables to be passed to functions; execute the kernel functions; rather than collectively
This in itself is contrary to the premises of parallel implementation, to a large extent; hence my concerns