Profiling deadloop (replay kernel) with nvprof on deep neural network

When use nvprof to profiling some deep neural network from Keras, the profiling process eventual trapped in a deadloop that keep prompting “Replaying kernel cgemm_sm35_ldg_tn_64x8x64x16x16” and never stop (for days, and most probably forever) .

Some details on configuration:
Ubuntu 16.04
CUDA 8.0/ cuDNN 5.1
Tesla K40 GPU
Tensorflow 1.1.0
Keras 0.5

Hi, talksmalltao

Sorry for the trouble firstly.

Which nvprof command are you using ?

Basically, if you want to collect more metrics/events, it will replay kernel to get the result.

Would you please just try to collect 1 or just several metrics/events?

Hi veraj,

I’m using nvprof from cuda 8.

For the metrics/events to be collected, my observation is it looks like nvprof works well in profiling tensorflow deep neural network with events, and some ‘simple’ metrics (e.g. l1_cache_global_hit_rate). However, if the metric implies the use of gputime (say, throughput-like metrics, flop_sp_efficiency), even if it is the only metric to collect, nvprof will be trapped into the “Replaying kernel cgemm_sm35_ldg_tn_64x8x64x16x16” deadloop.

Thanks talksmalltao.

I’ll find a Tesla and try with Tensorflow.
I’ll get back to you once I finished.

Hi, talksmalltao

I have find a Tesla and installed TensorFlow and Keras.
Here are some details need your confirmation


  1. Which neural network are you using?
  2. Any other else need download, like training dataset ? And how and where to get ?
  3. Can you tell the exact command or steps that can reproduce the issue ?


Hi veraj,

  1. I used VGG16 from Keras
  2. I guess any training dataset is OK as long as Keras can feed them to the NN model
  3. nvprof --metrics flop_sp_efficiency python trainDNN***.py

Some clarifications:

  1. It is not necessary to use keras, I guess using Tensorflow without Keras can also reproduce the result
  2. I guess it is not a problem only for model training, the result can be reproduced with inference
  3. It is not a problem only for flop_sp_efficiency, the result can be reproduced with any throughput-like metrics
  4. It may not be a problem only for VGG either, since there is not much specifics from VGG architecture

Hi, talksmalltao

I reproduced the issue already using tensorflow Inception v3 model. Already report a bug for the dev.
I will update once I got any message.

Thanks for raising this.

Hi veraj,

Appreciate your effort, looking forward to a quick fix on that from the dev.


This slowdown is probably not because of a deadlock. Deep learning applications launch a very large amount of kernels rapidly, and each of these kernels is usually small and lightweight. These apps also heavily rely on concurrency, which means that multiple kernels are launched concurrently from several streams.

When you attempt to profile a metric or event with nvprof, all the concurrent kernels in the application are serialized - i.e. they are launched one after the other. This is what causes the tremendous slowdown.

Furthermore, metrics like flop_sp_efficiency cannot be profiled in a single pass, and the kernel needs to be replayed to measure them. This further increases profiling time.

The good news is that deep learning apps launch the same kernels over and over again, and that their performance won’t largely vary across different runs. So you can get a meaningful picture of the performance profile using the following steps:

  1. Use the Visual Profiler to get a trace of the application, without doing any profiling. You can run the application with default settings using the Visual Profiler. Alternatively, you can use the command "nvprof -o foo.nvprof python my_tensorflow_app" and loading the resulting foo.nvprof file into Visual Profiler.

    Viewing the trace in the Visual Profiler will give you a good idea of how the application is launching kernels. Note that a pure tracing run like this, without profiling, will not serialize kernels and hence won’t cause the slowdown.

  2. Now run the application with your previous profiling command. Just as before, you will experience a slowdown. After a few minutes, kill the application early using Ctrl+C. nvprof will report the performance metrics of the kernels finished until that point. You should get metrics for nearly all kernels. This should be meaningful data and representative for the rest of your application since the same kernels repeat again and again.

I hope this helps.