I have been benchmarking my program on different video cards with different driver versions and I’ve come across a strange issue.
I have tested 169.21, 175.19 and the new 177.35 beta on the GeForce 8800 GTS 640 and the GeForce 8800 GT 512. There is no difference between the 169.21 and 175.19 versions on either card. The 177.35 is in general a fair bit slower on both cards (I guess there might be some debugging features still in there or something).
The strange issue happens only on the GeForce 8800 GT 512. It happens only with 169.21 and 175.19. My program writes a 2D array of SizeX x SizeY floats. The array is allocated using cudaAllocPitch(). My block size is 16 x 32 threads. My grid size is (SizeX + 15) / 16 x (SizeY + 31) / 32. Each thread writes to the float at(blockIdx.x * 16 + threadIdx.x, blockIdx.y * 32 + threadIdx.y) as long as this is within (0, 0) to (SizeX - 1, SizeY - 1) otherwise it does nothing. I’ve noticed that if SizeX = 1024 and SizeY = 1024 my kernel takes about 3 times as long as it should. If I change SizeY to 1022 then it takes the time I expect. If I change SizeY to 1023 then it takes some time in between. Note that this doesn’t affect the block size or the pitch of the array only the outcome of an if statement inside the kernel!
Can anyone shed any light on this?