unknown error (memory out of bound error) while adding local variable within kernel

I have a kernel as

__global__ static void CalcSTLDistance_Kernel(Integer ComputeParticleNumber)
{
	//const Integer TID = CudaGetTargetID();
	const Integer ID  =CudaGetTargetID(); 
	/*if(ID >= ComputeParticleNumber)
	{
		return ;
	}*/
	CDistance NearestDistance;
	Integer NearestID = -1;
	NearestDistance.Magnitude = 1e8;
	NearestDistance.Direction.x = 0;
	NearestDistance.Direction.y = 0;
	NearestDistance.Direction.z = 0;//make_Scalar3(0,0,0);
	//if(c_daOutputParticleID[ID] < -1)
	//{
	//	c_daSTLDistance[ID] = NearestDistance;
	//	c_daSTLID[ID] = NearestID;
	//	return;
	//}

	//Scalar3 TargetPosition = c_daParticlePosition[ID];

	Integer TriangleID;
	
	Integer CIDX, CIDY, CIDZ;
	Integer CID = GetCellID(&CONSTANT_BOUNDINGBOX,&c_daParticlePosition[ID],CIDX, CIDY, CIDZ);
	if(CID >=0 && CID < c_CellNum)
	{
		//Integer Range = 1;
		for(Integer k = -1; k <= 1; ++k)
		{
			for(Integer j = -1; j <= 1; ++j)
			{
				for(Integer i = -1; i <= 1; ++i)
				{
					Integer MCID = GetCellID(&CONSTANT_BOUNDINGBOX,CIDX +i, CIDY + j,CIDZ + k);
					if(MCID < 0 || MCID >= c_CellNum)
					{
						continue;
					}
					unsigned int TriangleNum = c_daCell[MCID].m_TriangleNum;
					Integer *TID;
					TID = new Integer[TriangleNum];
					//TID[1] = c_daCell[MCID].m_TriangleID[1];

					for(unsigned int l = 0; l < TriangleNum; ++l)
					{
						TriangleID = c_daCell[MCID].m_TriangleID[l];
						
						/*if(c_daTrianglesParameters[c_daTriangles[TriangleID].ModelIDNumber].isDrag)
						{
							continue;
						}*/

						if( TriangleID >= 0 && TriangleID < c_TriangleNum && TriangleID != NearestID)// No need to calculate again for the same triangle
						{
						CDistance Distance ;
							Distance.Magnitude = CalcDistance(&c_daTriangles[TriangleID], &c_daParticlePosition[ID], &Distance.Direction);
							if(Distance.Magnitude < NearestDistance.Magnitude)
							{
								NearestDistance = Distance;
								NearestID = TriangleID;
							}
						}
					}	
					delete [] TID;
				}
				
			}
		}
	}
	c_daSTLDistance[ID] = NearestDistance;
	c_daSTLID[ID] = NearestID;
}

runs fine when TID[1] = c_daCell[MCID].m_TriangleID[1]; is commented and when I uncomment it it shows

Invalid global read of size 4
========= at 0x000015d0 in STLDistance
========= by thread (0,0,0) in block (1,0,0)
========= Address 0x200ffb428 is out of bounds

is there any limitation to local memory for each thread or something???

There is a limit on the amount of local memory for each thread. You can adjust it upward if need be. Check the CUDA documentation. I am doubtful that is the issue here.

Looking at the code, the likeliest hypothesis is that MCID has a value different from what you expect, causing the out-of-bounds access. I would print all variables feeding into the address computation plus the array’s base pointer to check how the address of 0x200ffb428 is reached, then compare that with the size of the allocation.

even if I put

TID[1] = 3;

it shows

========= Invalid __global__ write of size 4
=========     at 0x00001108 in CalcSTLDistance_Kernel(int)
=========     by thread (109,0,0) in block (2,0,0)
=========     Address 0x53007fa14 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time

This operation may be failing:

Integer *TID;					
TID = new Integer[TriangleNum];

device new (and malloc) allocate out of the device heap. The device heap is by default limited to 8MB. If you attempt to allocate more than that (collectively, across all outstanding new or malloc operations), some new or malloc operations will fail, and return a null pointer. If you then attempt to use this null pointer anyway, you will get an invalid pointer dereference, which could lead to the invalid global write you are depicting. (It’s not obvious that this issue in the previous example would lead to the invalid global read depicted there, however. So I’m offering this as just a hunch.)

If you suspect this issue, a useful debugging step is to test all pointers returned from new (or malloc) for NULL before using them. If you encounter a NULL pointer, then you have probably exceeded the device heap limit.

The device heap limit can be changed with a CUDA runtime function.

Refer to the documentation:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#dynamic-global-memory-allocation-and-operations

Are you sure that TriangleNum is never smaller than 2?