Incorrect Result after large loop in kernel

nicain · November 1, 2010, 5:44am

Hello all,

I am just learning, and have found a situation where an incorrect result comes back from a kernel call. I have simplified the code down to a single small file that demonstrates the error, but can’t seem to figure out what I could be doing wrong. The basic problem results from running too many consecutive memory accesses from inside a loop. I have tried preventing/forcing loop unrolling (pragma unroll), intentionally splitting the loop into two, (with different index variables), and a few other simple idea, but nothing. Any help is very appreciated (if I’m making a stupid error), or trying the code on different architectures/os/cards for reproducability!

I have attached a code that causes the error. It should generate:

Begin Program:
Initial Value: 0.000000
Final Value: 16777216.000000

Interestingly, I only see the problem for loop sizes > 16777216 (thats 2^24); I don’t know what to make of this, but don’t believe its a coincidence! Below is the output of deviceQuery:

deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: “GeForce 8600M GT”
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 1
Total amount of global memory: 268238848 bytes
Number of multiprocessors: 4
Number of cores: 32
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 0.94 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: No
Device has ECC support enabled: No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 1, Device = GeForce 8600M GT

-n
cudaBug.cu (2.8 KB)

nicain · November 1, 2010, 5:44am

Hello all,

I am just learning, and have found a situation where an incorrect result comes back from a kernel call. I have simplified the code down to a single small file that demonstrates the error, but can’t seem to figure out what I could be doing wrong. The basic problem results from running too many consecutive memory accesses from inside a loop. I have tried preventing/forcing loop unrolling (pragma unroll), intentionally splitting the loop into two, (with different index variables), and a few other simple idea, but nothing. Any help is very appreciated (if I’m making a stupid error), or trying the code on different architectures/os/cards for reproducability!

I have attached a code that causes the error. It should generate:

Begin Program:
Initial Value: 0.000000
Final Value: 16777216.000000

Interestingly, I only see the problem for loop sizes > 16777216 (thats 2^24); I don’t know what to make of this, but don’t believe its a coincidence! Below is the output of deviceQuery:

deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: “GeForce 8600M GT”
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 1
Total amount of global memory: 268238848 bytes
Number of multiprocessors: 4
Number of cores: 32
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 0.94 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: No
Device has ECC support enabled: No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 1, Device = GeForce 8600M GT

-n

tera · November 1, 2010, 10:24am

Indeed that’s not a coincidence. 2[sup]-24[/sup] is the relative precision of float variables. So once you reach 2[sup]24[/sup], adding 1 will no longer change the variable.

tera · November 1, 2010, 10:24am

Indeed that’s not a coincidence. 2[sup]-24[/sup] is the relative precision of float variables. So once you reach 2[sup]24[/sup], adding 1 will no longer change the variable.

nicain · November 1, 2010, 8:28pm

Of course! Do you know if it will simply no longer change the variable, or cause an overflow? In a more interesting version of this, I get screen artifacts as a consequence. Never happens with only 1/thread/1/block like my simple example, but when things get more complicated… I get a nice blurry screen as a result.

Thanks for your help! I should have thought of that :rolleyes:

nicain · November 1, 2010, 8:28pm

Of course! Do you know if it will simply no longer change the variable, or cause an overflow? In a more interesting version of this, I get screen artifacts as a consequence. Never happens with only 1/thread/1/block like my simple example, but when things get more complicated… I get a nice blurry screen as a result.

Thanks for your help! I should have thought of that :rolleyes:

tera · November 2, 2010, 12:43am

It will just not change the variable. On IEEE-754 compliant hardware you might be able to get an ‘inexact’ exception, but not on current Nvidia GPUs. So I don’t think your blurry screen is related to that.

Maybe it’s be related to the watchdog timer setting in? Don’t know what your operating system is and don’t have a lot of experience with it on different operating systems, since I mostly work with a dedicated GPU.

tera · November 2, 2010, 12:43am

It will just not change the variable. On IEEE-754 compliant hardware you might be able to get an ‘inexact’ exception, but not on current Nvidia GPUs. So I don’t think your blurry screen is related to that.

Maybe it’s be related to the watchdog timer setting in? Don’t know what your operating system is and don’t have a lot of experience with it on different operating systems, since I mostly work with a dedicated GPU.