nvprof error Application received signal 11

Hi,

I am trying to profile caffe with nvprof.

The problem is when caffe reaches around 9000 iterations it breaks.
However, If I run the same application without nvprof it finishes successfully.

The command that I am running is:
/usr/local/cuda-8.0/bin/nvprof --profile-child-processes --print-gpu-summary ./build/tools/caffe train …

The output of the error:
[…]
I0110 11:33:31.030156 119795 sgd_solver.cpp:106] Iteration 9200, lr = 0.001
*** Aborted at 1484066012 (unix time) try “date -d @1484066012” if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 119795 (TID 0x3fffb7fff190) from PID 0; stack trace: ***
@ 0x3fffb7f90478 ([vdso]+0x477)javascript:void(0);
@ 0x3fff9aa2b908 (unknown)
@ 0x3fff9a9ec6ec (unknown)
@ 0x3fff9a9e7178 (unknown)
@ 0x3fff9b2702ac (unknown)
@ 0x3fff9b0ad858 (unknown)
@ 0x3fff9b1452ac (unknown)
@ 0x3fff9b20b288 (unknown)
@ 0x3fff9b19dcc8 (unknown)
@ 0x3fff9b0b6094 (unknown)
@ 0x3fff9b0b7134 (unknown)
@ 0x3fff9afabe9c (unknown)
@ 0x3fff9afabfdc (unknown)
@ 0x3fff9b13e770 cuMemcpy
@ 0x3fffb700c578 (unknown)
@ 0x3fffb6fe1870 (unknown)
@ 0x3fffb7027658 cudaMemcpy
@ 0x3fffb78cc738 caffe::caffe_copy<>()
@ 0x3fffb78e5f9c caffe::BasePrefetchingDataLayer<>::Forward_gpu()
@ 0x3fffb786d8d0 caffe::Net<>::ForwardFromTo()
@ 0x3fffb786dca8 caffe::Net<>::Forward()
@ 0x3fffb788c690 caffe::Solver<>::Step()
@ 0x3fffb788ce68 caffe::Solver<>::Solve()
@ 0x3fffb78e4f08 caffe::P2PSync<>::Run()
@ 0x10013668 train()
@ 0x100109a0 main
@ 0x3fffb6dd4700 generic_start_main.isra.0
@ 0x3fffb6dd48f4 __libc_start_main
@ 0x0 (unknown)
==119795== Profiling application: ./build/tools/caffe train --solver=examples/cifar10/cifar10_full_solver.prototxt -gpu all
==119795== Profiling result:
Time(%) Time Calls Avg Min Max Name
30.43% 28.9407s 15176 1.9070ms 620.35us 3.4906ms void cudnn::detail::divNorm_bw_45d_kernel<float, float, bool=0>(cudnnTensorStruct, float const , float const , cudnnTensorStruct, float const , float const , float const , float const , float, cudnnTensorStruct, cudnnLRNStruct)
17.24% 16.3984s 95880 171.03us 43.648us 362.82us void cudnn::detail::implicit_convolve_sgemm<float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0>(int, int, int, float const , int, cudnn::detail::implicit_convolve_sgemm<float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0>, float const , kernel_conv_params, int, float, float)
16.65% 15.8349s 23970 660.61us 273.98us 1.2552ms void cudnn::detail::precomputed_convolve_sgemm<float, int=512, int=6, int=8, int=3, int=3, int=5, int=1, bool=1>(int, int, int, float const , int, cudnn::detail::precomputed_convolve_sgemm<float, int=512, int=6, int=8, int=3, int=3, int=5, int=1, bool=1>, float const , kernel_conv_params, int, float, float, int)
7.82% 7.43760s 22764 326.73us 203.55us 749.47us void cudnn::detail::wgrad_alg0_engine<float, int=512, int=6, int=5, int=3, int=3, int=3, bool=1, int=512>(int, int, int, float const , int, cudnn::detail::wgrad_alg0_engine<float, int=512, int=6, int=5, int=3, int=3, int=3, bool=1, int=512>, float const , kernel_grad_params, int, float, int, int)
7.47% 7.10733s 15176 468.33us 342.02us 763.68us void cudnn::detail::dgrad_alg1_engine<float, int=512, int=6, int=5, int=3, int=3, int=3, bool=1, bool=0>(int, int, int, float const , int, float const , int, cudnn::detail::dgrad_alg1_engine<float, int=512, int=6, int=5, int=3, int=3, int=3, bool=1, bool=0>, kernel_grad_params, int, int, float, int)
7.42% 7.05360s 15980 441.40us 75.264us 843.94us void cudnn::detail::divNorm_fw_45d_kernel<float, float, bool=0>(cudnnTensorStruct, float const , float const , cudnnTensorStruct, float const , float const , float, cudnnTensorStruct, cudnnLRNStruct)
5.53% 5.26066s 22764 231.10us 27.296us 444.83us void calc_bias_diff<int=2, float, float, int=128, int=0>(cudnnTensorStruct, float const , cudnnTensorStruct, float, float, float, int)
2.14% 2.03573s 7588 268.28us 266.91us 298.85us void caffe::MaxPoolBackward(int, float const , int const , float const , int, int, int, int, int, int, int, int, int, int, int, int, caffe::MaxPoolBackward)
0.68% 646.09ms 15178 42.567us 30.272us 60.128us void cudnn::detail::pooling_bw_kernel_avg<float, float, cudnn::detail::averpooling_func, int=1>(cudnnTensorStruct, float const , float const , cudnn::detail::pooling_bw_kernel_avg<float, float, cudnn::detail::averpooling_func, int=1>, float const , cudnn::detail::pooling_bw_kernel_avg<float, float, cudnn::detail::averpooling_func, int=1>, cudnnTensorStruct, cudnnPoolingStruct, float, cudnnPoolingStruct, int, cudnn::reduced_divisor, float)
0.53% 499.49ms 7990 62.514us 61.376us 70.240us void caffe::MaxPoolForward(int, float const , int, int, int, int, int, int, int, int, int, int, int, int, caffe::MaxPoolForward, int
, float const )
0.45% 428.62ms 7990 53.644us 52.448us 56.192us void add_tensor_kernel_v3<int=2, float, float, int=64, int=1, int=2, int=4, int=2>(cudnnTensorStruct, float
, cudnnTensorStruct, float const , float, float)
0.45% 424.04ms 15176 27.941us 11.808us 128.22us void setTensor4d_kernel<float, float, int=16, int=16>(cudnnTensor4dStruct, float
, float)
0.42% 397.82ms 22765 17.474us 7.7120us 25.088us void cudnn::detail::activation_bw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=1, bool=0>>(cudnnTensorStruct, float const , float const , cudnn::detail::activation_bw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=1, bool=0>>, float const , cudnnTensorStruct, float, cudnnTensorStruct
, int, double)
0.39% 366.97ms 32068 11.443us 543ns 49.920us [CUDA memcpy HtoD]
0.38% 362.24ms 7990 45.336us 43.744us 51.520us void gemm_kernel1x1_core<float, bool=0, bool=0, bool=0, bool=1, bool=0>(float
, float const , float const , int, int, int, int, int, int, float, float, float, float, int)
0.28% 269.51ms 15980 16.865us 12.064us 23.872us void cudnn::detail::pooling_fw_4d_kernel<float, float, cudnn::detail::averpooling_func, int=1>(cudnnTensorStruct, float const , cudnn::detail::pooling_fw_4d_kernel<float, float, cudnn::detail::averpooling_func, int=1>, cudnnTensorStruct, cudnnPoolingStruct, float, cudnnPoolingStruct, int, cudnn::reduced_divisor, float)
0.23% 223.45ms 23970 9.3220us 4.8640us 16.672us void cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=1, bool=0>>(cudnnTensorStruct, float const , cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=1, bool=0>>, cudnnTensorStruct, float, cudnnTensorStruct
, int, double)
0.21% 204.43ms 15980 12.792us 10.879us 17.408us void add_tensor_kernel_v3<int=2, float, float, int=32, int=1, int=4, int=2, int=2>(cudnnTensorStruct, float
, cudnnTensorStruct, float const , float, float)
0.14% 132.39ms 7590 17.442us 16.704us 18.848us maxwell_sgemm_128x64_raggedMn_nn
0.14% 132.38ms 22764 5.8150us 2.8160us 9.4400us void cudnn::detail::scale_filter_kernel<int=16, int=16>(cudnnFilter4dStruct, float
, float)
0.13% 126.93ms 7590 16.722us 16.288us 21.439us gen_kmul4_sgemmNT2_core(float const , int, float const , int, float, int, int, int, int, float const , float const , float, float, int)
0.09% 83.902ms 7589 11.055us 10.720us 12.160us [CUDA memcpy PtoP]
0.09% 83.588ms 56910 1.4680us 1.0870us 3.0720us void axpy_kernel_val<float, float, int=0>(cublasAxpyParamsVal<float, float, float>)
0.08% 78.356ms 46734 1.6760us 1.5360us 3.3920us caffe::sync_conv_groups(void)
0.08% 72.626ms 23970 3.0290us 2.4960us 5.6640us void kern_precompute_indices<bool=0>(int
, int, int, int, int, int, int)
0.07% 68.371ms 23570 2.9000us 1.4720us 6.7520us [CUDA memcpy DtoD]
0.05% 48.440ms 7590 6.3820us 5.6640us 7.1360us void gemv2N_kernel_val<float, float, float, int=128, int=32, int=4, int=4, int=1>(float, float, cublasGemv2Params_v2<float, float, float>)
0.05% 47.178ms 30352 1.5540us 1.1200us 2.2720us void caffe::SGDUpdate(int, float
, float, caffe::SGDUpdate, caffe::SGDUpdate)
0.05% 43.337ms 15980 2.7110us 2.1750us 3.6160us void asum_kernel<float, float, int=0>(cublasAsumParams<float, float>)
0.04% 42.392ms 7990 5.3050us 4.9600us 6.0160us void cudnn::detail::softmax_fw_kernel<int=2, float, float, int=256, int=1, int=1, int=0>(cudnnTensorStruct, float const , cudnn::detail::softmax_fw_kernel<int=2, float, float, int=256, int=1, int=1, int=0>, cudnnTensorStruct, int, float, cudnnTensorStruct
, int, int)
0.04% 41.438ms 15980 2.5930us 2.0160us 3.6800us void dot_kernel<float, float, float, int=128, int=0, int=0>(cublasDotParams<float, float>)
0.04% 36.933ms 68376 540ns 511ns 6.3680us [CUDA memset]
0.04% 33.534ms 15980 2.0980us 1.4400us 3.1360us void reduce_1Block_kernel<float, float, float, int=128, int=7>(float
, int, float*)
0.03% 30.348ms 7990 3.7980us 3.2000us 4.6080us void caffe::SoftmaxLossForwardGPU(int, float const , float const , caffe::SoftmaxLossForwardGPU, int, int, int, bool, int, float const *)
0.03% 28.461ms 7590 3.7490us 3.4880us 4.2880us void caffe::SoftmaxLossBackwardGPU(int, float const , float const , caffe::SoftmaxLossBackwardGPU, int, int, int, bool, int, float const *)
0.03% 23.908ms 7990 2.9920us 2.6560us 3.6160us void gemmk1_kernel<float, int=256, int=5, bool=0, bool=0, bool=0, bool=0>(cublasGemmk1Params, float const *, float const , float)
0.02% 22.398ms 11384 1.9670us 1.4720us 2.7520us void scal_kernel_val<float, float, int=0>(cublasScalParamsVal<float, float>)
0.02% 14.462ms 24796 583ns 543ns 42.912us [CUDA memcpy DtoH]
0.01% 11.037ms 3794 2.9090us 2.2720us 3.7760us void caffe::add_kernel(int, float const , float const , caffe::add_kernel)
==119795== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.
======== Error: Application received signal 11

This is probably asking too much of a profiler.

I can get the profiler to break on way fewer than 9000 iterations of some codes.

I suggest you think about a way of reducing the profiling scope. It should not be necessary to run 9000 iterations of anything to get a good handle on the profiling data you want.

Hi,

I observed a similar error when profiling caffe with nvprof. And this happens as early as the 2nd iteration.
However, when running the without nvprof it completes without any error.

Here is the command I used:

/usr/local/cuda-8.0/bin/nvprof -o caffe_10Iter.nvprof caffe train …

The output where the error occurs :


I0223 19:16:51.121872 33665 solver.cpp:244] Train net output #0: loss = 6.90776 (* 1 = 6.90776 loss)
I0223 19:16:51.146852 33665 sgd_solver.cpp:106] Iteration 2, lr = 0.001
*** Aborted at 1487895411 (unix time) try “date -d @1487895411” if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 33665 (TID 0x3fff9e933b40) from PID 0; stack trace: ***
    @     0x3fffb48604d8 ([vdso]+0x4d7)
    @     0x3ffd1d42b838 (unknown)
    @     0x3ffd1d3ec61c (unknown)
    @     0x3ffd1d3e70a8 (unknown)
    @     0x3ffd1dc6f3ec (unknown)
    @     0x3ffd1daacfd8 (unknown)
    @     0x3ffd1db445ec (unknown)
    @     0x3ffd1dc0a458 (unknown)
    @     0x3ffd1db9cfe8 (unknown)
    @     0x3ffd1dab57a4 (unknown)
    @     0x3ffd1dab6844 (unknown)
    @     0x3ffd1d9abe5c (unknown)
    @     0x3ffd1d9abf9c (unknown)
@ 0x3ffd1d9abe5c (unknown)
@ 0x3ffd1d9abf9c (unknown)
@ 0x3ffd1db3dad0 cuMemcpy
@ 0x3fffb3dfc578 (unknown)
@ 0x3fffb3dd1870 (unknown)
@ 0x3fffb3e17658 cudaMemcpy
@ 0x3fffb407ace0 caffe::caffe_copy<>()
@ 0x3fffb422d430 caffe::BatchNormLayer<>::Forward_gpu()
@ 0x3fffb4030d4c caffe::Net<>::ForwardFromTo()
@ 0x3fffb4030ea8 caffe::Net<>::Forward()
@ 0x3fffb40a2b34 caffe::Solver<>::Step()
@ 0x3fffb40a35d0 caffe::Solver<>::Solve()
@ 0x3fffb4025860 caffe::P2PSync<>::Run()
@ 0x1000aea0 (unknown)
@ 0x100075c4 (unknown)
@ 0x3fffb343309c (unknown)
@ 0x3fffb3433298 __libc_start_main
@ 0x0 (unknown)
==33665== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.
======== Error: Application received signal 11

In what sense? If the SIGSEGV is caused by the profiler (which is not at all clear from the data presented so far!), that would be a bug that should be reported. Triggering a segfault is never an acceptable response by a tool, no matter what input it was given. If the tool is running out of dynamically allocated memory, for example, it should print an appropriate error message and stop.

I’m having quite a similar error.
The program runs perfectly without nvprof, but when I try to make the profile I get this error:

iSCC Total electronic Diff electronic SCC error
1 -0.17076504E+05 0.00000000E+00 0.77261213E+00
2 -0.17078820E+05 -0.23164648E+01 0.61310061E+00
3 -0.17081041E+05 -0.22213999E+01 0.10049590E+00
4 -0.17081044E+05 -0.27207681E-02 0.69814173E-01

[b]Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x7fa4aba4eccd in ???
#1 0x7fa4aba4df2d in ???
#2 0x7fa49e659eff in ???
#3 0x7fa4b1260f55 in ???
#4 0x7fa4b125fffe in ???
#5 0x7fa4b126025b in ???
#6 0x7fa4b1177fb2 in ???
#7 0x7fa4b11780ee in ???
#8 0x7fa4abd6af4f in ???
#9 0x7fa1b782e606 in ???
==8043== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.
==8043== Generated result file: /home/ccevallos/17/_build/prog/dftb+/datanew4096.nvprof
======== Error: Application received signal 11[/b]

I have no idea why the profiler is causing this segmentation fault.

As njuffa has already pointed out, this should be considered a bug in the profiler.

You should gather together all necessary data so that an independent investigator could reproduce your exact case, and file that information in a bug report with NVIDIA at developer.nvidia.com

After you have done all that, and while you are waiting for a fix from NVIDIA, if you still want to pursue the work you are trying to accomplish, you might want to see if you can somehow reduce the profiling scope to avoid running into this particular profiler defect.

Which flags do I have to put to JUST get the CPU profile? I know I must put “cpu-profiling on” flag, but I still get the GPU one.
And on the other hand, when using nvvp I cannot generate the PDF report. How can I do that?

I don’t think its possible to get only CPU profile information. If you want that, you might want to explore another profiler such as gprof. You might be able to minimize the amount of GPU information printed using --print-api-summary

To generate the PDF report in the guided analysis section, it is necessary to move the guided analysis forward to the point where you have requested an individual kernel analysis.

So after starting a new session, when the timeline is displayed, go to the Analysis tab and:

  1. select “Examine Individual Kernels”
  2. In the results tab, highlight one of the kernels by selecting it
  3. select “Perform Kernel Analysis”
    The “Export PDF Report” button is now selectable to generate the report. It will be specific to the kernel you just selected for analysis in step 2.

Thanks!

And a last question. Which flags would you use to reduce the GPU profiling scope? To avoid getting the segmentation fault.

You can search the profiler doc for the word “scope” and you will find some information such as here:

https://docs.nvidia.com/cuda/profiler-users-guide/index.html#profiling-scope

For example there are switches to identify a specific kernel to profile by name.

However reducing scope might be easiest and best done with modifications to your program.

Does your program call the same CUDA activity in a loop?

reduce the number of loops.

Does your program have kernels that process a large data set?

reduce the size of your data set.

etc.

I wouldn’t be able to give you an exact recipe to reduce the scope so as to avoid the seg fault in your specific case. Some trial and error may be involved. And its possible that the seg fault has nothing to do with profiling scope. However I have run into problems like that when the profiling scope is large.