No typo. I tried cudaGetLastError and found the problem.

I got: too many resources requested for launch.

I know it’s happening when I am fetching values from textures in two embedded loops. But I have no idea about how to fix it…

Below is the snippet where problem arises, in the kernel. Num = {9, 9, 2}, SearchSize = {9, 9, 3}, texImage1 and texImage2 are both 512*512*65*sizeof(unsigned short int) = 65 MB. Grid size is 16*16, block size is 4*4*32, and I’m using Quadro FX 3800. I have a structure array which occupies 1.5 MB global memory, a float array which occupies 0.5 MB global memory, and some local variables which only occupy 180 B.

[codebox]for(int k = 0; k < Num.z; k++)

```
for(int j = 0; j < Num.y; j++)
for(int i = 0; i < Num.x; i++)
{
int Total = 0;
for(int iZ = 0; iZ < SearchSize.z; iZ++)
for(int iY = 0; iY < SearchSize.y; iY++)
for(int iX = 0; iX < SearchSize.x; iX++)
{
PixCoord.x = Start.x + iX;
PixCoord.y = Start.y + iY;
PixCoord.z = Start.z + iZ;
prevalue = tex3D(texImage1, (float)PixCoord.x, (float)PixCoord.y, (float)PixCoord.z);
postvalue = tex3D(texImage2, (float)(TemStart.x + iX), (float)(TemStart.y + iY), (float)(TemStart.z + iZ));
Total = Total + abs(value2 - value1);
}
if((float)Total < MinMeasure[Idx])
MinMeasure[Idx] = (float)Total;
}
[/codebox]
```

The two textures occupy most memory, but it’s OK, the board has 1 GB global memory. And all those textures, variables together could not consume the whole global memory. When I was fetching from the two textures and returned the values to two 256 kB arrays without any loop, the kernel ran well. So I am wondering if fetching textures inside the loop will consume much more resource depending on the number of iterations?

Thanks,

Yuping