Strange performance issue on 8800 GT Possibly a driver bug? Fixed in 177.35?

I have been benchmarking my program on different video cards with different driver versions and I’ve come across a strange issue.

I have tested 169.21, 175.19 and the new 177.35 beta on the GeForce 8800 GTS 640 and the GeForce 8800 GT 512. There is no difference between the 169.21 and 175.19 versions on either card. The 177.35 is in general a fair bit slower on both cards (I guess there might be some debugging features still in there or something).

The strange issue happens only on the GeForce 8800 GT 512. It happens only with 169.21 and 175.19. My program writes a 2D array of SizeX x SizeY floats. The array is allocated using cudaAllocPitch(). My block size is 16 x 32 threads. My grid size is (SizeX + 15) / 16 x (SizeY + 31) / 32. Each thread writes to the float at(blockIdx.x * 16 + threadIdx.x, blockIdx.y * 32 + threadIdx.y) as long as this is within (0, 0) to (SizeX - 1, SizeY - 1) otherwise it does nothing. I’ve noticed that if SizeX = 1024 and SizeY = 1024 my kernel takes about 3 times as long as it should. If I change SizeY to 1022 then it takes the time I expect. If I change SizeY to 1023 then it takes some time in between. Note that this doesn’t affect the block size or the pitch of the array only the outcome of an if statement inside the kernel!

Can anyone shed any light on this?

This is a known bug, and will be fixed in the next driver release.

I appreciate the reply and I’m not trying to be a pain but is it possible to please get

a) some more information about the cause of the bug or the circumstances under which I would see it (so that I can work around it)

and/or

b) an expected release date for the fixed driver?

Thanks.

The cause of the bug was a driver change that impacted performance. It cannot be worked around.

We plan to have a new driver release within the next few weeks.

I did some further investigation and determined that the wierd slowdowns were due to my memory access patterns (even though they were still fully coallesced). I was able to eliminate them by changing the layout of my data in memory. I now get excellent performance with 175.19.

So I assume the bug you are talking about is the one that causes 177.35 to be significantly slower than 175.19? Only I’ve now tested the new 177.79 release and it seems to still have the problem yet this driver is set to be the next official WHQL release?