I can see that “region”=“total”-“init”, but “kernels”+“data” is nothing like “region”. Where is the rest of the time going, if not on “kernels” or “data”?
If the host you are running has multiple CPUs, I would use you preference method to bind to one particular core to prevent the program jumping around.
There might be some granularity issue. The region time is measured from the host. The kernel and data times are from the device. I would try to make the test run longer to see if the difference is smaller.
Thanks. What I am trying to do is show how fast the code would go if I could cut out the data transfer (i.e. make the data resident on the accelerator between subroutine calls). What would be the best number to quote: summed “kernels” times or total program time minus summed “data” times, or something else?
In response to your suggestions, I used “taskset” on the Linux host to bind the process. I also made sure it was not using the OpenMP directives in the code. There was still a discrepancy.
Regarding granularity, another kernel in the same code had these stats:
1227: region entered 4 times
time(us): total=363636 init=6 region=363630
kernels=13323 data=182556
These times are thousands of microseconds per call. This would be very poor granularity.
I find this odd as well. I’ve tried my best to recreate the mismatch times here, but as of yet no luck. The profiling information is collected via calls to the CUDA driver so there could be a problem in how this information is collected or it could be something specific to your system. If you’re able to share the code, I’d be interested in seeing if I can recreate the problem here.
Another tack would be to not use “-ta=time” and instead use the CUDA profiler. Full documetnation can be found HERE but the simplest method is to set the following environment variables and then run your program.
You can change CUDA_PROFILE_LOG to what ever file name you wish. The “%d” refers the device used. If multiple devices are used, then multiple profiles will be created.