Hi,
I’m doing a test to see how big the index of tex1Dfetch() could be.
Here is my test kernel:
texture<float, cudaTextureType1D, cudaReadModeElementType> tex1DLimitRef;
extern "C" __global__ void tex1DLimitKernel(float* input, float* output, int length)
{
for (int i = threadIdx.x; i < length; i += blockDim.x)
{
output[i] = tex1Dfetch(tex1DLimitRef, i) + 0.5f;
//output[i] = input[i] + 0.5f;
}
}
Before I give the result, I have to discribe my test environment:
-
Windows 7
-
GTX 260 with 1G video memory
so, first, I set the length to be 20480000, it works and the result passes the verify.
then I set the length to be 25600000, then the program crashes, the Windows screen blink serveral times, I got CUDA_ERROR_LAUNCH_TIMEOUT error.
then I go into control panel of Windows, in the performance setting, set it to be “best performance” instead of “best UI”. Then all my Windows 7 UI effects gone. Then I run it with length=25600000 again, it sometimes works, but most of time, still crash.
Then I changed the code to be:
//output[i] = tex1Dfetch(tex1DLimitRef, i) + 0.5f;
output[i] = input[i] + 0.5f;
Then it works fine, so it proves the device memory works, but the tex1Dfetch() not work.
then here is the device query information:
C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.0\C\bin\win32\Release\deviceQuery.exe Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Found 2 CUDA Capable device(s)
Device 0: "GeForce GTX 260"
CUDA Driver Version / Runtime Version 4.0 / 4.0
CUDA Capability Major/Minor version number: 1.3
Total amount of global memory: 879 MBytes (922091520 bytes)
(27) Multiprocessors x ( 8) CUDA Cores/MP: 216 CUDA Cores
GPU Clock Speed: 1.44 GHz
Memory Clock rate: 1150.00 Mhz
Memory Bus Width: 448-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 2 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
So, here is my questions:
-
why the tex1Dfetch() not work at some big indexing but the device memory works? does it uses some more resources so that if I turn off some UI effects, it sometimes works?
-
in the device query result, it says that “Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)”, but obviousely, my index is much much bigger than 8192, why it still works? is this limitation only happens for tex1D(), not tex1Dfetch()?
Hope to hear from you guys
Regards,
Xiang.