Incorrect calculation results for thread block size equal to 512

Consider the following code (a result of reducing to minimal code reproducing the bug) (see attachment, the forum seem to have some problem with codeboxes).

Execution results:
$ make clean; make; release/wigner
grid 4 block 128
0.321925
grid 2 block 256
0.321925
grid 1 block 512
1.000000

So, for blocks of size 256 and lower, result is correct, but for block size 512 it is not. And if I comment initialize(), or remove sine or cosine from calculate(), bug will disappear and results will be the same for all three block sizes.

Has anyone encountered such problem? And could anyone please try to build and run this code?

I am using Cuda driver 2.3.1a, Cuda toolkit 2.3a, and SDK 2.3a
MBpro with OSX 10.6.1, reproduces on both 9400 and 9600 video cards.
wigner.cu (1.14 KB)

I tried mine and I have correct output…

grid 4 block 128

0.321925

grid 2 block 256

0.321925

grid 1 block 512

0.321925

Press any key to continue . . .

Tested on Windows 7 64bit (compiled as 64bit executable, -arch sm_10), GPU GTX 260

Do you do any error-checking? It might be that you use too many registers for a block size of 512. That would also explain the fact it works on gtx260 (it has double the amount of registers compared to your hardware)

Thanks, seems to be the case. That’s what happens, when you are being lazy and try to estimate card capabilities without consulting the profiler (

1>ptxas info	: Compiling entry function '_Z9calculatePf'

1>ptxas info	: Used 15 registers, 56+0 bytes lmem, 8+16 bytes smem, 24 bytes cmem(0), 72 bytes cmem(1)

1>ptxas info	: Compiling entry function '_Z10initializePf'

1>ptxas info	: Used 2 registers, 8+16 bytes smem, 24 bytes cmem(0)

15 registers * 512 threads = 7680. There shouldn’t be any problem over here, or? …

P.S. There seems to be a problem on the forum. I tried posting this, but with [0] instead of (0) and 3 times I ended up on the main page without my post being added…

[aol]Me too![/aol]

Tried twice to post to the forum but bounced back to main page each time. As they were my first two posts I wondered if they had been placed in a moderation queue or something. Let’s see if this one appears.

Paul

P.S. It did. Hmmm…

With the compiler limiting registers to 8 or 32, both have the same working result. i’m running Vista X64. it might be something to do with the 64bit driver but it shouldn’t, i’m also running a 9800GTX so no sm_13. :S

I:\Nvidia\CUDA\NVIDIA GPU Computing SDK\C\bin\win64\Release>test

grid 4 block 128

0.321925

grid 2 block 256

0.321925

grid 1 block 512

0.321925