size limitation of tex1Dfetch


I’m doing a test to see how big the index of tex1Dfetch() could be.

Here is my test kernel:

texture<float, cudaTextureType1D, cudaReadModeElementType> tex1DLimitRef;

extern "C" __global__ void tex1DLimitKernel(float* input, float* output, int length)


	for (int i = threadIdx.x; i < length; i += blockDim.x)


		output[i] = tex1Dfetch(tex1DLimitRef, i) + 0.5f;

		//output[i] = input[i] + 0.5f;



Before I give the result, I have to discribe my test environment:

  1. Windows 7

  2. GTX 260 with 1G video memory

so, first, I set the length to be 20480000, it works and the result passes the verify.

then I set the length to be 25600000, then the program crashes, the Windows screen blink serveral times, I got CUDA_ERROR_LAUNCH_TIMEOUT error.

then I go into control panel of Windows, in the performance setting, set it to be “best performance” instead of “best UI”. Then all my Windows 7 UI effects gone. Then I run it with length=25600000 again, it sometimes works, but most of time, still crash.

Then I changed the code to be:

//output[i] = tex1Dfetch(tex1DLimitRef, i) + 0.5f;

		output[i] = input[i] + 0.5f;

Then it works fine, so it proves the device memory works, but the tex1Dfetch() not work.

then here is the device query information:

C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.0\C\bin\win32\Release\deviceQuery.exe Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Found 2 CUDA Capable device(s)

Device 0: "GeForce GTX 260"

  CUDA Driver Version / Runtime Version          4.0 / 4.0

  CUDA Capability Major/Minor version number:    1.3

  Total amount of global memory:                 879 MBytes (922091520 bytes)

  (27) Multiprocessors x ( 8) CUDA Cores/MP:     216 CUDA Cores

  GPU Clock Speed:                               1.44 GHz

  Memory Clock rate:                             1150.00 Mhz

  Memory Bus Width:                              448-bit

  Max Texture Dimension Size (x,y,z)             1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)

  Max Layered Texture Size (dim) x layers        1D=(8192) x 512, 2D=(8192,8192) x 512

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       16384 bytes

  Total number of registers available per block: 16384

  Warp size:                                     32

  Maximum number of threads per block:           512

  Maximum sizes of each dimension of a block:    512 x 512 x 64

  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1

  Maximum memory pitch:                          2147483647 bytes

  Texture alignment:                             256 bytes

  Concurrent copy and execution:                 Yes with 1 copy engine(s)

  Run time limit on kernels:                     Yes

  Integrated GPU sharing Host Memory:            No

  Support host page-locked memory mapping:       Yes

  Concurrent kernel execution:                   No

  Alignment requirement for Surfaces:            Yes

  Device has ECC support enabled:                No

  Device is using TCC driver mode:               No

  Device supports Unified Addressing (UVA):      No

  Device PCI Bus ID / PCI location ID:           2 / 0

  Compute Mode:

     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

So, here is my questions:

  1. why the tex1Dfetch() not work at some big indexing but the device memory works? does it uses some more resources so that if I turn off some UI effects, it sometimes works?

  2. in the device query result, it says that “Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)”, but obviousely, my index is much much bigger than 8192, why it still works? is this limitation only happens for tex1D(), not tex1Dfetch()?

Hope to hear from you guys



Your kernel (sometimes) runs for too long so the watchdog timer triggers that is supposed to keep the computer usable by keeping the user interface running in case of a runaway kernel.

Divide the work into multiple, smaller kernel launches. On Windows you also need to call cudaDeviceSynchronize() between kernel launches in order to prevent the driver from batching them, which would undo the effect of splitting the work into smaller packages.

The limit on textures accessed via tex1Dfetch() is 2**27 elements (corresponding to 512 MB for element type of int/float; 1 GB for element type of double/int2/float2; 2 GB for element type of double2/int4/float4). This is documented in appendix F of the Programming Guide.

Thanks for your reply, but I tested, my situation is not caused by the too long kernel running, see the below for my answer.

Thanks very much for your reply, which is very helpful and I looked again the manual on appendix F, which says:

Maximum width for a 1D texture reference bound to linear memory is 2**27 for compute capability from 1.0 to 2.x

And I did the test based on this limitation, it still fails, then I found that my test is compiled with debug in nvcc kernel compiling. So I changed it, to compile the kernel with release settings, turn off the debugging and turn on optimization to /O3. Then it seems work.

Cause I need two block of memory, one is dInput, and the other is dOutput, so if I set the test data to be 10241024128 float (which is 512 MB), then my test still fails, because that requires 512MB dInput and 512MB dOutput, and that exceed my total video memory, so I have to reduce the test data to be lenght=1024102490, which will be 360MB dInput and 360MB dOutput, and with release version kernel, it works! So this looks quite fit the documentation. But I think there might need some words in the documentation talking about the difference between debug version and release version. I think debug version will require more resources, so some times, it cannot reach the limitation as the documentation.

Thanks again for your help.



The limitations on texture size are imposed by hardware restrictions, and independent of how the software is compiled. As tera already explained, when a kernel fails with a CUDA_ERROR_LAUNCH_TIMEOUT, the kernel ran too long and was terminated by the watchdog timer. This is orthogonal to the use of textures in the code. If memory serves, timeout limits are between 5 and 10 seconds depending on OS platform.

Code built with debug settings usually runs slower (sometimes much slower) than code built with release settings, so one would be more likely to hit the timeout limit with a debug build. The watchdog timers are an operating system feature that prevents the display from freezing indefinitely while a CUDA kernel is running. CUDA kernels running on GPUs to which no display is attached are not subject to the watchdog timer, so that would be one solution to problems with long-running kernels hitting the timeout.

I think you are right, because if I run the kernel built with debug version, the screen blinks, and then there is some information popup from the tray, says that the driver is recovered from a crash, which I think might be the os detected that the video card is not responsable because of the long time running of kernel, so it considered it as a system fails, thus restart the driver. Thanks, this makes sense.