Consider the following code (a result of reducing to minimal code reproducing the bug) (see attachment, the forum seem to have some problem with codeboxes).
So, for blocks of size 256 and lower, result is correct, but for block size 512 it is not. And if I comment initialize(), or remove sine or cosine from calculate(), bug will disappear and results will be the same for all three block sizes.
Has anyone encountered such problem? And could anyone please try to build and run this code?
I am using Cuda driver 2.3.1a, Cuda toolkit 2.3a, and SDK 2.3a
MBpro with OSX 10.6.1, reproduces on both 9400 and 9600 video cards. wigner.cu (1.14 KB)
Do you do any error-checking? It might be that you use too many registers for a block size of 512. That would also explain the fact it works on gtx260 (it has double the amount of registers compared to your hardware)
1>ptxas info : Compiling entry function '_Z9calculatePf'
1>ptxas info : Used 15 registers, 56+0 bytes lmem, 8+16 bytes smem, 24 bytes cmem(0), 72 bytes cmem(1)
1>ptxas info : Compiling entry function '_Z10initializePf'
1>ptxas info : Used 2 registers, 8+16 bytes smem, 24 bytes cmem(0)
15 registers * 512 threads = 7680. There shouldn’t be any problem over here, or? …
P.S. There seems to be a problem on the forum. I tried posting this, but with [0] instead of (0) and 3 times I ended up on the main page without my post being added…
Tried twice to post to the forum but bounced back to main page each time. As they were my first two posts I wondered if they had been placed in a moderation queue or something. Let’s see if this one appears.
With the compiler limiting registers to 8 or 32, both have the same working result. i’m running Vista X64. it might be something to do with the 64bit driver but it shouldn’t, i’m also running a 9800GTX so no sm_13. :S