It seems that nvprof alters the execution of a program!
Really, the program runs without error when I don’t use nvprof. However, when I feed it to nvprof, I get some internal error in the program and not nvidia errors.
The program is gromcas and it uses mpi threads. I think due to the profiling overhead and computation, thread scheduling alters and one threads reaches a point which should get an input from another thread while the other thread is under called by nvprof! I tried some options including --profile-child-processes however, it has no effect.
Something like that… Any one has faced such a problem? I hope that txbob has an idea