I found some strange error in my program.
I have two GTX295 cards with 4 GPUs and run 4 host threads on the 4 GPUs. The 4 threads run individually and don’t write to the same host memory, just read from the same mapped page-locked memory.
If I run the 4 threads in the same time, the results will be wrong. If running them one after one, the results are right.
Because I have more than one thread , i can’t use cuda-gdb. And I also failed to run it in the -deviceemu mode due to fail to allocate page-locked memory.
I guess that there may be some memory visit error in the global memory.
Are there some methods to find them??