nvprof make cudaMemset failed

Hi

I’m trying to use nvprof for my gpu application.
I use L4T32.2 and cuda-10.0.
Although my application works without nvprof, it failed with nvprof.
My command is like this

nvprof -fo prof.nvp ./my_application

I use cudaDeviceSynchronize() and check errors.
I found that, with nvprof, error occurred after cudaMemset.

My code is like this.

cudaMalloc((void**)&p, buff_size);
err = cudaDeviceSynchronize();
checkError(err);

cudaMemset((void*)p, 0, buff_size);
err = cudaDeviceSynchronize();
checkError(err); // error occurred

The error code was below.

cudaErrorLaunchFailure
unspecified launch failure

Does anybody have an idea?
Can nvprof affects the gpu behaviour?

Here’s the dmesg output.

[ 6784.212525] nvgpu: 17000000.gv11b               gp10b_priv_ring_isr:121  [ERR]  ringmaster intr status0: 0x00000100,status1: 0x00000001
[ 6784.212753] nvgpu: 17000000.gv11b               gp10b_priv_ring_isr:149  [ERR]  SYS write error. ADR 0x00408044 WRDAT 0x000007fe INFO 0x1d408213 (subid 0x0000001d priv level 0), CODE 0xbadf1301
[ 6784.213015] nvgpu: 17000000.gv11b gp10b_priv_ring_decode_error_code:79   [ERR]  client timeout
[ 6784.213175] nvgpu: 17000000.gv11b               gp10b_priv_ring_isr:175  [ERR]  GPC0 write error. ADR 0x00418cf4 WRDAT 0xfffffffe INFO 0x1e40822c (subid 0x0000001e priv level 0), CODE 0xbadf1201
[ 6784.213436] nvgpu: 17000000.gv11b gp10b_priv_ring_decode_error_code:79   [ERR]  client timeout
[ 6784.221247] nvgpu: 17000000.gv11b               gp10b_priv_ring_isr:121  [ERR]  ringmaster intr status0: 0x00000100,status1: 0x00000001
[ 6784.221465] nvgpu: 17000000.gv11b               gp10b_priv_ring_isr:149  [ERR]  SYS write error. ADR 0x00408044 WRDAT 0x000007fe INFO 0x19408213 (subid 0x00000019 priv level 0), CODE 0xbadf1301
[ 6784.221764] nvgpu: 17000000.gv11b gp10b_priv_ring_decode_error_code:79   [ERR]  client timeout
[ 6784.223331] nvgpu: 17000000.gv11b               gp10b_priv_ring_isr:175  [ERR]  GPC0 write error. ADR 0x00419a44 WRDAT 0x0000000e INFO 0x1940820d (subid 0x00000019 priv level 0), CODE 0xbadf1201
[ 6784.240976] nvgpu: 17000000.gv11b gp10b_priv_ring_decode_error_code:79   [ERR]  client timeout
[ 6784.495330] nvgpu: 17000000.gv11b        gk20a_gr_handle_fecs_error:5273 [INFO]  ctxsw intr0 set by ucode, timestamp buffer full
[ 6785.217598] nvgpu: 17000000.gv11b      gr_gk20a_handle_sm_exception:5692 [ERR]  could not pre-process sm error!
[ 6785.217842] nvgpu: 17000000.gv11b                      gk20a_gr_isr:6195 [ERR]  set gr exception notifier
[ 6785.217996] nvgpu: 17000000.gv11b   nvgpu_set_error_notifier_locked:135  [ERR]  error notifier set to 13 for ch 505
[ 6785.218162] ---- mlocks ----

[ 6785.218216] ---- syncpts ----
[ 6785.218223] id 2 (disp_a) min 13635 max 13635 refs 1 (previous client : )
[ 6785.218226] id 3 (disp_b) min 3 max 3 refs 1 (previous client : )
[ 6785.218233] id 8 (vblank0) min 405545 max -6 refs 1 (previous client : )
[ 6785.218243] id 20 (gv11b_511) min 665986 max 665986 refs 1 (previous client : )
[ 6785.218266] id 21 (gv11b_510) min 28 max 28 refs 1 (previous client : )
[ 6785.218270] id 22 (gv11b_509) min 35254 max 35254 refs 1 (previous client : gv11b_509)
[ 6785.218274] id 23 (progress) min 175 max 0 refs 1 (previous client : progress)
[ 6785.218281] id 29 (gv11b_505_user) min 5511198 max 5511198 refs 1 (previous client : gv11b_508_user)
[ 6785.218285] id 31 (gv11b_506_user) min 5509036 max 5508751 refs 1 (previous client : gv11b_507_user)
[ 6785.218290] id 33 (gv11b_507_user) min 5517549 max 5517549 refs 1 (previous client : gv11b_506_user)
[ 6785.218294] id 35 (gv11b_508_user) min 5442658 max 5442658 refs 1 (previous client : gv11b_505_user)
[ 6785.218299] id 37 (gv11b_504_user) min 4587520 max 4587520 refs 1 (previous client : gv11b_504_user)

[ 6785.218673] ---- channels ----
[ 6785.218707] 
               channel 2 - 15820000.se

[ 6785.218710] NvHost basic channel registers:
[ 6785.218716] CMDFIFO_STAT_0:  00002040
[ 6785.218720] CMDFIFO_RDATA_0: 0a28cc02
[ 6785.218725] CMDP_OFFSET_0:   00000000
[ 6785.218729] CMDP_CLASS_0:    00000000
[ 6785.218733] CHANNELSTAT_0:   00000000
[ 6785.218736] The CDMA sync queue is empty.

[ 6785.218742] 
               channel 3 - 15830000.se

[ 6785.218745] NvHost basic channel registers:
[ 6785.218749] CMDFIFO_STAT_0:  00002040
[ 6785.218752] CMDFIFO_RDATA_0: 43806a4c
[ 6785.218757] CMDP_OFFSET_0:   00000000
[ 6785.218760] CMDP_CLASS_0:    00000000
[ 6785.218763] CHANNELSTAT_0:   00000000
[ 6785.218767] The CDMA sync queue is empty.

[ 6785.218773] 
               channel 4 - 15840000.se

[ 6785.218775] NvHost basic channel registers:
[ 6785.218779] CMDFIFO_STAT_0:  00002040
[ 6785.218800] CMDFIFO_RDATA_0: 58068089
[ 6785.218804] CMDP_OFFSET_0:   00000000
[ 6785.218807] CMDP_CLASS_0:    00000000
[ 6785.218811] CHANNELSTAT_0:   00000000
[ 6785.218813] The CDMA sync queue is empty.

[ 6785.218821] 
               ---- host general irq ----

[ 6785.218825] sync_intc0mask = 0x00000001
[ 6785.218829] sync_intmask = 0x50000003
[ 6785.218831] 
               ---- host syncpt irq mask ----

[ 6785.218834] 
               ---- host syncpt irq status ----

[ 6785.218838] syncpt_thresh_cpu0_int_status(0) = 0x00000000
[ 6785.218842] syncpt_thresh_cpu0_int_status(1) = 0x00000000
[ 6785.218845] syncpt_thresh_cpu0_int_status(2) = 0x00000000
[ 6785.218849] syncpt_thresh_cpu0_int_status(3) = 0x00000000
[ 6785.218852] syncpt_thresh_cpu0_int_status(4) = 0x00000000
[ 6785.218869] syncpt_thresh_cpu0_int_status(5) = 0x00000000
[ 6785.218873] syncpt_thresh_cpu0_int_status(6) = 0x00000000
[ 6785.218876] syncpt_thresh_cpu0_int_status(7) = 0x00000000
[ 6785.218880] syncpt_thresh_cpu0_int_status(8) = 0x00000000
[ 6785.218883] syncpt_thresh_cpu0_int_status(9) = 0x00000000
[ 6785.218887] syncpt_thresh_cpu0_int_status(10) = 0x00000000
[ 6785.218890] syncpt_thresh_cpu0_int_status(11) = 0x00000000
[ 6785.218894] syncpt_thresh_cpu0_int_status(12) = 0x00000000
[ 6785.218897] syncpt_thresh_cpu0_int_status(13) = 0x00000000
[ 6785.218900] syncpt_thresh_cpu0_int_status(14) = 0x00000000
[ 6785.218904] syncpt_thresh_cpu0_int_status(15) = 0x00000000
[ 6785.218907] syncpt_thresh_cpu0_int_status(16) = 0x00000000
[ 6785.218911] syncpt_thresh_cpu0_int_status(17) = 0x00000000
[ 6785.218914] syncpt_thresh_cpu0_int_status(18) = 0x00000000
[ 6785.218917] syncpt_thresh_cpu0_int_status(19) = 0x00000000
[ 6785.218921] syncpt_thresh_cpu0_int_status(20) = 0x00000000
[ 6785.218924] syncpt_thresh_cpu0_int_status(21) = 0x00000000
[ 6785.218930] gv11b pbdma 0: 
[ 6785.218936] id: 2 (tsg), next_id: 2 (tsg) chan status: valid
[ 6785.218950] PBDMA_PUT: 0000001f004155a0 PBDMA_GET: 0000001f004155a0 GP_PUT: 00000f0a GP_GET: 00000f0a FETCH: 00000f0a HEADER: 60400000
               HDR: 00000000 SHADOW0: 00415578 SHADOW1: 0000281f
[ 6785.218954] gv11b pbdma 1: 
[ 6785.218957] id: 3 (tsg), next_id: 3 (tsg) chan status: valid
[ 6785.218970] PBDMA_PUT: 00000002008402e8 PBDMA_GET: 00000002007447b4 GP_PUT: 00000002 GP_GET: 0000083f FETCH: 00000127 HEADER: 20111b08
               HDR: 20022060 SHADOW0: 00744784 SHADOW1: 00034e02
[ 6785.218974] gv11b pbdma 2: 
[ 6785.218977] id: 4 (tsg), next_id: 4 (tsg) chan status: valid
[ 6785.218989] PBDMA_PUT: 0000000200a6e2c4 PBDMA_GET: 0000000200a6e208 GP_PUT: 000005b2 GP_GET: 000005b2 FETCH: 000005b2 HEADER: 20100068
               HDR: 20050017 SHADOW0: 00a6e1d8 SHADOW1: 0000ee02

[ 6785.218997] gv11b eng 0: 
[ 6785.219000] id: 3 (tsg), next_id: 3 (tsg), ctx status: valid 
[ 6785.219003] busy 

[ 6785.219009] gv11b eng 1: 
[ 6785.219012] id: 4 (tsg), next_id: 4 (tsg), ctx status: valid 

[ 6785.219031] gv11b eng 2: 
[ 6785.219034] id: 0 (tsg), next_id: 0 (tsg), ctx status: invalid 

[ 6785.219040] gv11b eng 3: 
[ 6785.219043] id: 3 (tsg), next_id: 3 (tsg), ctx status: invalid 


[ 6785.219112] 504-gv11b, pid 18664, refs: 2: 
[ 6785.219116] channel status:  in use on_pbdma busy
[ 6785.219121] RAMFC : TOP: 0000000000000000 PUT: 0000000000000000 GET: 0000000000000000 FETCH: 0000000000000000
               HEADER: 20400000 COUNT: 00000000
               SEMAPHORE: addr hi: 00000000 addr lo: 00000000
               payload 00000000 execute 00000000

[ 6785.219129] 505-gv11b, pid 18664, refs: 2: 
[ 6785.219132] channel status:  in use pending busy
[ 6785.219137] RAMFC : TOP: 0000000000000000 PUT: 00000002006a3798 GET: 00000002006a3150 FETCH: 00000202006a3798
               HEADER: 20111b08 COUNT: 05550002
               SEMAPHORE: addr hi: 00000002 addr lo: 00027dc4
               payload 00000006 execute 00080003

[ 6785.219143] 506-gv11b, pid 18664, refs: 4: 
[ 6785.219146] channel status:  in use pending busy
[ 6785.219150] RAMFC : TOP: 0000000000000000 PUT: 0000000200744ad0 GET: 00000002007447b4 FETCH: 0000020200744ad0
               HEADER: 20110180 COUNT: 04550002
               SEMAPHORE: addr hi: 00000002 addr lo: 00027dec
               payload 00000006 execute 00080003

[ 6785.219157] 507-gv11b, pid 18664, refs: 2: 
[ 6785.219160] channel status:  in use on_pbdma busy
[ 6785.219164] RAMFC : TOP: 0000000000000000 PUT: 00000002008402e8 GET: 00000002008402d4 FETCH: 00000202008402e8
               HEADER: 20111b08 COUNT: 05550002
               SEMAPHORE: addr hi: 00000002 addr lo: 0001ffb0
               payload 000005b2 execute 00081003

[ 6785.219171] 508-gv11b, pid 18664, refs: 2: 
[ 6785.219173] channel status:  in use idle not busy
[ 6785.219178] RAMFC : TOP: 0000000000000000 PUT: 00000002009402a4 GET: 00000002009402a4 FETCH: 00000202009402a4
               HEADER: 60400000 COUNT: 84000000
               SEMAPHORE: addr hi: 00000000 addr lo: 00000000
               payload 00000000 execute 00000000

[ 6785.219184] 509-gv11b, pid 8556, refs: 2: 
[ 6785.219187] channel status:  in use idle not busy
[ 6785.219192] RAMFC : TOP: 8000001ef754855c PUT: 0000001ef7548560 GET: 0000001ef754855c FETCH: 00000c1ef7548560
               HEADER: 20060028 COUNT: 01110004
               SEMAPHORE: addr hi: 0000001f addr lo: 00002000
               payload 00003543 execute 00081002

[ 6785.219198] 510-gv11b, pid 7139, refs: 2: 
[ 6785.219201] channel status:  in use idle not busy
[ 6785.219205] RAMFC : TOP: 8000001f00414460 PUT: 0000001f00414460 GET: 0000001f00414460 FETCH: 0000001f00414460
               HEADER: 60400000 COUNT: 80000000
               SEMAPHORE: addr hi: 0000001f addr lo: 00407000
               payload 00000000 execute 00000001

[ 6785.219226] 511-gv11b, pid 7139, refs: 2: 
[ 6785.219229] channel status:  in use idle not busy
[ 6785.219233] RAMFC : TOP: 8000001f004094f0 PUT: 0000001f004094f0 GET: 0000001f004094f0 FETCH: 0000001f004094f0
               HEADER: 60400000 COUNT: 80000000
               SEMAPHORE: addr hi: 0000001f addr lo: 00404000
               payload 00000000 execute 00100001

[ 6785.231475] nvgpu: 17000000.gv11b         gv11b_fb_print_fault_info:680  [ERR]  [MMU FAULT] mmu engine id:  65, ch id:  506, fault addr: 0x0, fault addr aperture: 0, fault type: invalid pde, access type: virt write, 
[ 6785.231797] nvgpu: 17000000.gv11b         gv11b_fb_print_fault_info:689  [ERR]  [MMU FAULT] protected mode: 0, client type: gpc, client id:  t1 6, gpc id if client type is gpc: 0, 
[ 6785.358400] nvgpu: 17000000.gv11b gk20a_fifo_tsg_unbind_channel_verify_status:2200 [ERR]  Channel 504 to be removed from TSG 4 has NEXT set!
[ 6785.358634] nvgpu: 17000000.gv11b          gk20a_tsg_unbind_channel:164  [ERR]  Channel 504 unbind failed, tearing down TSG 4

Hi,

I tried Nsight systems to profile but it failed.
I choose “Attach by PID” mode on host machine and execute following command on Jetson.

sudo sh -c "LD_PRELOAD=/opt/nvidia/nsight_systems/libToolsInjectionProxy64.so QUADD_INJECTION_PROXY=CUDA ./my_application"

And the same behaviour reproduced, cudaMemset or cudaMemcpy fails.

Hi,

You will need the root authority for the CUDA profiler.
Could you try to run the command like this:

sudo /usr/local/cuda-10.0/bin/nvprof -fo prof.nvp ./my_application

Thansk.

Hi,

Thank you for your suggestion.
Sorry, I forgot to add ‘sudo’ in the first post.
I already tried with ‘sudo’ and it failed.

Hi,

Thanks for your feedback.
Do you meet the same error with root authority?

Thanks.

Hi
yes the same error appearred

Hi,

We try to profile a similar CUDA sample matrixMul (with cudaMalloc and cudaDeviceSynchronize) but the profiler works correctly.

nvidia@nvidia-desktop:~/NVIDIA_CUDA-10.0_Samples/0_Simple/matrixMul$ sudo /usr/local/cuda-10.0/bin/nvprof -fo prof.nvp ./matrixMul
[sudo] password for nvidia: 
[Matrix Multiply Using CUDA] - Starting...
==1903== NVPROF is profiling process 1903, command: ./matrixMul
==1903== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
GPU Device 0: "NVIDIA Tegra X1" with compute capability 5.3

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 21.21 GFlop/s, Time= 6.180 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.
==1903== Generated result file: /home/nvidia/NVIDIA_CUDA-10.0_Samples/0_Simple/matrixMul/prof.nvp

Would you mind to profile the sample to see if it also works on your environment?
If yes, please share the source of your application with us to have further investigation.

Thanks.

Hi,

I tried with sample cuda program, and the profile succeeded.
So, my program might have specific reason.
Unfortunately, I cannot share my source code so I’ll try to solve this by myself.

Thank you anyway for your support.