I am working on an algorithm for research and will publish it if results are good. I developed the initial code on the laptop (it has 950M compute 5.0), however, it has small device memory so CPU version of the code is faster than GPU. I tried running the code on K40c and K40m (compute 3.5). In all three cases, I used same dataset in spite of which the computation is different. Is this common? Is my algorithm behaving differently depending on which GPU device I am using? I don’t have a whole lot of experience programming CUDA but assume it should not be the case.
Please suggest how I can go about identifying and fixing the problem.
If you use exactly the same CUDA version on both devices, use identical compilation switches, and your code contains neither bugs (e.g. race conditions, out-of-bounds accesses) nor non-deterministic operations (e.g. atomic floating-point operations), then you should get matching results for the same input data set passed to the GPU.
Use cuda-memcheck to check for race conditions and out of bounds accesses. Note that this tool will find many but not all instances of such problems. Your host computation (if any) may have similar issues, use valgrind or a similar to to find some of the issues. Your host computation may also pass different input data to the GPU due to different compilers, compiler versions, or libraries. When in doubt, dump the entire GPU input in raw binary form to double check.
Avoid JIT compilation, as that makes enforcing the “same CUDA version” provision more tricky, as the JIT compiler is part of the CUDA driver and may be updated at different intervals than CUDA (and therefore the offline compiler) itself.