I got a problem in CUDA that keeps me busy for two days know. Its alll about this line of code executed on the device:
s_result[(tx+ty*bw)/8] |= 1<<tx%8;
Eight threads in parallel shift a 1 to the left (threadID modulo 8 times) and then write them by an or to one character.
It works fine in device-emulation mode but I receive incorrect values for true parallel execution. Any ideas why? Is it even possible to acces one char by several threads simultaneously?