This slowdown is probably not because of a deadlock. Deep learning applications launch a very large amount of kernels rapidly, and each of these kernels is usually small and lightweight. These apps also heavily rely on concurrency, which means that multiple kernels are launched concurrently from several streams.
When you attempt to profile a metric or event with nvprof, all the concurrent kernels in the application are serialized - i.e. they are launched one after the other. This is what causes the tremendous slowdown.
Furthermore, metrics like flop_sp_efficiency cannot be profiled in a single pass, and the kernel needs to be replayed to measure them. This further increases profiling time.
The good news is that deep learning apps launch the same kernels over and over again, and that their performance won’t largely vary across different runs. So you can get a meaningful picture of the performance profile using the following steps:
- Use the Visual Profiler to get a trace of the application, without doing any profiling. You can run the application with default settings using the Visual Profiler. Alternatively, you can use the command "nvprof -o foo.nvprof python my_tensorflow_app" and loading the resulting foo.nvprof file into Visual Profiler.
Viewing the trace in the Visual Profiler will give you a good idea of how the application is launching kernels. Note that a pure tracing run like this, without profiling, will not serialize kernels and hence won’t cause the slowdown.
- Now run the application with your previous profiling command. Just as before, you will experience a slowdown. After a few minutes, kill the application early using Ctrl+C. nvprof will report the performance metrics of the kernels finished until that point. You should get metrics for nearly all kernels. This should be meaningful data and representative for the rest of your application since the same kernels repeat again and again.
I hope this helps.