I have completed writing my CUDA kernel, and confirmed it runs as expected when I compile it using nvcc directly, by:
- Validating with test data over 100 runs (just in case)
- Using cuda-memcheck (memcheck, synccheck, racecheck, initcheck)
Yet, the results printed into the terminal while the application is getting profiled using Nsight Compute differs from run to run. I am curious if the difference is a cause for concern, or if this is the expected behavior.
Note: The application also gives correct & consistent results while getting profiled bu nvprof.