Profiling deadloop (replay kernel) with nvprof on deep neural network

talksmalltao · June 27, 2017, 2:03am

When use nvprof to profiling some deep neural network from Keras, the profiling process eventual trapped in a deadloop that keep prompting “Replaying kernel cgemm_sm35_ldg_tn_64x8x64x16x16” and never stop (for days, and most probably forever) .

Some details on configuration:
Ubuntu 16.04
CUDA 8.0/ cuDNN 5.1
Tesla K40 GPU
Tensorflow 1.1.0
Keras 0.5

veraj · June 28, 2017, 6:21am

Hi, talksmalltao

Sorry for the trouble firstly.

Which nvprof command are you using ?

Basically, if you want to collect more metrics/events, it will replay kernel to get the result.

Would you please just try to collect 1 or just several metrics/events?

talksmalltao · July 6, 2017, 1:41am

Hi veraj,

I’m using nvprof from cuda 8.

For the metrics/events to be collected, my observation is it looks like nvprof works well in profiling tensorflow deep neural network with events, and some ‘simple’ metrics (e.g. l1_cache_global_hit_rate). However, if the metric implies the use of gputime (say, throughput-like metrics, flop_sp_efficiency), even if it is the only metric to collect, nvprof will be trapped into the “Replaying kernel cgemm_sm35_ldg_tn_64x8x64x16x16” deadloop.

veraj · July 6, 2017, 2:35am

Thanks talksmalltao.

I’ll find a Tesla and try with Tensorflow.
I’ll get back to you once I finished.

veraj · July 12, 2017, 3:39am

Hi, talksmalltao

I have find a Tesla and installed TensorFlow and Keras.
Here are some details need your confirmation

Like

Which neural network are you using?
Any other else need download, like training dataset ? And how and where to get ?
Can you tell the exact command or steps that can reproduce the issue ?

Thanks!

talksmalltao · July 13, 2017, 2:50am

Hi veraj,

I used VGG16 from Keras
I guess any training dataset is OK as long as Keras can feed them to the NN model
nvprof --metrics flop_sp_efficiency python trainDNN***.py

Some clarifications:

It is not necessary to use keras, I guess using Tensorflow without Keras can also reproduce the result
I guess it is not a problem only for model training, the result can be reproduced with inference
It is not a problem only for flop_sp_efficiency, the result can be reproduced with any throughput-like metrics
It may not be a problem only for VGG either, since there is not much specifics from VGG architecture

veraj · July 17, 2017, 9:34am

Hi, talksmalltao

I reproduced the issue already using tensorflow Inception v3 model. Already report a bug for the dev.
I will update once I got any message.

Thanks for raising this.

talksmalltao · July 20, 2017, 12:58am

Hi veraj,

Appreciate your effort, looking forward to a quick fix on that from the dev.

apoorvaj · August 24, 2017, 10:32am

talksmalltao,

This slowdown is probably not because of a deadlock. Deep learning applications launch a very large amount of kernels rapidly, and each of these kernels is usually small and lightweight. These apps also heavily rely on concurrency, which means that multiple kernels are launched concurrently from several streams.

When you attempt to profile a metric or event with nvprof, all the concurrent kernels in the application are serialized - i.e. they are launched one after the other. This is what causes the tremendous slowdown.

Furthermore, metrics like flop_sp_efficiency cannot be profiled in a single pass, and the kernel needs to be replayed to measure them. This further increases profiling time.

The good news is that deep learning apps launch the same kernels over and over again, and that their performance won’t largely vary across different runs. So you can get a meaningful picture of the performance profile using the following steps:

Use the Visual Profiler to get a trace of the application, without doing any profiling. You can run the application with default settings using the Visual Profiler. Alternatively, you can use the command "nvprof -o foo.nvprof python my_tensorflow_app" and loading the resulting foo.nvprof file into Visual Profiler.
Viewing the trace in the Visual Profiler will give you a good idea of how the application is launching kernels. Note that a pure tracing run like this, without profiling, will not serialize kernels and hence won’t cause the slowdown.
Now run the application with your previous profiling command. Just as before, you will experience a slowdown. After a few minutes, kill the application early using Ctrl+C. nvprof will report the performance metrics of the kernels finished until that point. You should get metrics for nearly all kernels. This should be meaningful data and representative for the rest of your application since the same kernels repeat again and again.

I hope this helps.

Topic		Replies	Views
profiling for a long running applications Visual Profiler and nvprof	3	1921	August 9, 2017
nvprof is too slow Visual Profiler and nvprof	12	4803	January 25, 2022
nvprof shows DRAM throughput greater than theoretically possible Visual Profiler and nvprof	10	1773	January 11, 2018
Profiling and Optimizing Deep Neural Networks with DLProf and PyProf Technical Blog	13	1410	August 11, 2021
CUDA Pro Tip: nvprof is Your Handy Universal GPU Profiler Technical Blog	35	2457	September 5, 2021
how does nvprof collect events? Visual Profiler and nvprof	1	1461	August 14, 2017
Nvprof with --analysis-metrics freezes Jetson Tx2 after replaying kernel. Jetson TX2	3	698	February 21, 2018
How to get all kernels name? Nsight Compute	5	1635	July 24, 2020
Nvprof error when running cifar10_keras Frameworks tensorflow	0	444	April 7, 2020
nvprof: Internal profiling error 4277:5 on Tesla P100, but not on GTX 1070 Visual Profiler and nvprof	12	3982	October 12, 2021

Profiling deadloop (replay kernel) with nvprof on deep neural network

Related topics