Strange interaction between __shared__ memory and cuPrintf

[s]Hi all,

I’m running an application (a time-iterative PDE grid solver) I ported to CUDA through bugtesting because after about 100 iterations the last timestep differs from the original CPU version by either .000020 or .000021 (the choice of these two numbers appears to be random). cudamemcheck finds no problems with the program. It runs a few kernels each iteration but I’ve narrowed the problem down to just one. At one point, each thread in this kernel (which runs more than one block) needs to access an updated data element before and after it. This could cause problems if blocks do not all run at the same time so I gave each thread shared memory like so



__shared__ dataStruct CelebrityPhoneNumbers[THREADSPERBLOCK+2];



Each thread then updates its own element, with the two outer threads updating the extra two elements. Then I pass this to some inlined device code and go to work (the shared memory is not used after this device code). But I noticed a strange thing. Suppose I add

if (i == 65 && blockIdx.x == 0)

   cuPrintf("CelebrityPhoneNumbers[i=65].areacode = %.15f\n", CelebrityPhoneNumbers[i].areacode);

right before I pass this to the device code (there is no effect adding it after the code pass)… Then, the accumulated error leads to a final dt which varies from the CPU version by -.000025 or -.000024. If I add a more cuPrintf statements, one right after the other, at some cycle before correct program completion the output turns sour, like this:

[0, 65]: CelebrityPhoneNumbers[i=65].areacode = 0.433244496071099

[0, 65]: CelebrityPhoneNumbers[i=65].areacode = -nan

[0, 65]: CelebrityPhoneNumbers[i=65].areacode = -nan

where before all three outputs would be identical, and the program limps along for a bit until dt goes to infinity and the program ends. Does shared memory “wear out”? The kernel closes and is relaunched each cycle so CelebrityPhoneNumbers should be rebuilt every time, no?

If anyone has any insight I’d appreciate it very much; I’m not sure where to go from here.



EDIT: Clarity

EDIT: Update: Apparently this error doesn’t depend on CelebrityPhoneNumbers being involved in the cuPrintf() statement… must have something to do with thread alignment, but __syncthreads() right after this doesn’t have any effect.[/s]

EDIT: The problem was a missing __syncthreads(). The program now works correctly.