Hi Cudouer. The most major difference in architecture between the GTX 260 and GTX 480 is the cache structure, and that’s exactly where I’d start looking for your problem. In the GTX 260, any global memory accesses you perform go out to main memory. However, there is a L1 and L2 cache structure for GTX 480. One thing that trips up a lot of people is that the L1 cache is not coherent among separate SMs. L2 is coherent. Based on that info, I’d guess the most likely cause for problems is threads from different blocks reading/writing to the same global memory location. With a GTX 260, one thread can write to global memory, and that would eventually become readable by other threads in other blocks. However in a GTX 480, a thread can write to a global memory position, but that data may be reside in the L1 cache for that particular SM until the data is kicked out to L2. This could cause some problems. For example, if other threads write to the same memory location, the data could be lost!
You mentioned you have a reduction kernel, and you use the volatile keyword, which is good. But that isn’t going to solve cross-block thread communication issues. As Lev already said, there is likely some sort of synchronization or race condition in your code. For cross-block data communication, you definitely need to use atomics.
Take a look at section G.4.2 int he CUDA programming guide for instructions on how to disable L1 global caching. See if that solves your problem, then you can debug from there.