I have a code that I’m pretty sure is working correctly except for that CUDA 4/GTX bug where consistent reading from memory causes random errors (http://forums.nvidia.com/index.php?showtopic=199969) and that occasionally one of my main kernel calls will fail. This always occurs when I increase my grid size considerably (which leads to more blocks, though always far below hardware limits) but if I massage the program (recompiling a few times just commenting out some error checking functions) it will eventually run normally, and correctly. (the kernel error will break the program when cudaMemcpy tries to perform later on).
Has anyone ever encountered a problem like this? Local memory limits should only apply to single blocks I think and the program runs fine with small numbers of blocks with 128 threads (the ratio I use.) Cold reboots don’t seem to affect the problem.