Application returned non-zero code during profiling with nvprof

Hello, guys!

I have a problem with profiling by nvprof. I’ve installed Cuda Developer Toolkit 11.5.1. I have really done all actions for getting success build to sample of “vectorAdd” from preinstalled samples:

Сборка начата…
1>------ Сборка начата: проект: vectorAdd, Конфигурация: Debug x64 ------
1>Сборка начата 07.12.2021 15:05:07.
1>Целевой объект InitializeBuildStatus:
1>  Обращение к "x64/Debug/vectorAdd.tlog\unsuccessfulbuild".
1>Целевой объект AddCudaCompileDeps:
1>  Целевой объект "AddCudaCompileDeps" пропускается, так как все выходные файлы актуальны по отношению к входным.
1>Целевой объект WriteCudaCompileTlogs:
1>  Целевой объект "WriteCudaCompileTlogs" пропускается, так как все выходные файлы актуальны по отношению к входным.
1>Целевой объект CudaBuild:
1>  Целевой объект CudaBuildCore:
1>    Compiling CUDA source file vectorAdd.cu...
1>
1>    D:\Tools\CUDA\samples\v11.5\0_Simple\vectorAdd>"D:\Tools\CUDA\v11.5\bin\nvcc.exe" -gencode=arch=compute_35,code=\"sm_35,compute_35\" -gencode=arch=compute_37,code=\"sm_37,compute_37\" -gencode=arch=compute_50,code=\"sm_50,compute_50\" -gencode=arch=compute_52,code=\"sm_52,compute_52\" -gencode=arch=compute_60,code=\"sm_60,compute_60\" -gencode=arch=compute_61,code=\"sm_61,compute_61\" -gencode=arch=compute_70,code=\"sm_70,compute_70\" -gencode=arch=compute_75,code=\"sm_75,compute_75\" -gencode=arch=compute_80,code=\"sm_80,compute_80\" -gencode=arch=compute_86,code=\"sm_86,compute_86\" --use-local-env -ccbin "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64" -x cu   -I./ -I../../common/inc -I./ -ID:\Tools\CUDA\v11.5\/include -I../../common/inc -ID:\Tools\CUDA\v11.5\include  -G   --keep-dir x64\Debug  -maxrregcount=0  --machine 64 --compile -cudart static --threads 0 -g  -DWIN32 -DWIN32 -D_MBCS -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Fdx64/Debug/vc142.pdb /FS /Zi /RTC1 /MTd " -o x64/Debug/vectorAdd.cu.obj "D:\Tools\CUDA\samples\v11.5\0_Simple\vectorAdd\vectorAdd.cu"
1>    CUDACOMPILE : nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
1>    vectorAdd.cu
1>    vectorAdd.cu
1>    vectorAdd.cu
1>    vectorAdd.cu
1>    vectorAdd.cu
1>    vectorAdd.cu
1>    vectorAdd.cu
1>    vectorAdd.cu
1>    vectorAdd.cu
1>    vectorAdd.cu
1>    vectorAdd.cu
1>    vectorAdd.cu
1>    tmpxft_0000360c_00000000-7_vectorAdd.compute_86.cudafe1.cpp
1>  Сборка целевого объекта "CudaBuildCore" в проекте "vectorAdd_vs2019.vcxproj" завершена.
1>
1>  Сборка проекта "vectorAdd_vs2019.vcxproj" завершена.
1>Целевой объект Link:
1>  vectorAdd_vs2019.vcxproj -> D:\Tools\CUDA\samples\v11.5\bin\win64\Debug\vectorAdd.exe
1>Целевой объект FinalizeBuildStatus:
1>  Файл "x64/Debug/vectorAdd.tlog\unsuccessfulbuild" удаляется.
1>  Обращение к "x64/Debug/vectorAdd.tlog\vectorAdd.lastbuildstate".
1>
1>Сборка успешно завершена.
1>
1>CUDACOMPILE : nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
1>    Предупреждений: 1
1>    Ошибок: 0
1>
1>Прошло времени 00:00:09.70
========== Сборка: успешно: 1, с ошибками: 0, без изменений: 0, пропущено: 0 ==========

And this application is run successfully from console:

D:\Tools\CUDA\samples\v11.5\bin\win64\Debug>vectorAdd.exe
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

But when i start profiling with command nvprof I get that:

D:\Tools\CUDA\samples\v11.5\bin\win64\Debug>nvprof --metrics all vectorAdd.exe
[Vector addition of 50000 elements]
==18508== NVPROF is profiling process 18508, command: vectorAdd.exe
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
==18508== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
==18508== Replaying kernel "vectorAdd(float const *, float const *, float*, int)" (54 of 54)...
        1 internal events
==18508== Profiling application: vectorAdd.exe
==18508== Profiling result:
No events/metrics were profiled.
======== Error: Application returned non-zero code -1073741676

I don’t understand why application returns non-zero code. How I can resolve this problem? Please, help.

P.S.:
My GPU: NVidia GeForce GTX 1050
All tests (bandwithTest and deviceQuery from samples) are passed.

And:

D:\Tools\CUDA\samples\v11.5\bin\win64\Debug>nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:52:33_Pacific_Standard_Time_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

D:\Tools\CUDA\samples\v11.5\bin\win64\Debug>nvprof --version
nvprof: NVIDIA (R) Cuda command line profiler
Copyright (c) 2012 - 2021 NVIDIA Corporation
Release version 11.5.114 (21)

D:\Tools\CUDA\samples\v11.5\bin\win64\Debug>nvidia-smi -L
GPU 0: NVIDIA GeForce GTX 1050 (UUID: GPU-21a54983-7230-f355-5c03-1f4785f8b6e8)

D:\Tools\CUDA\samples\v11.5\bin\win64\Debug>nvidia-smi -q

==============NVSMI LOG==============

Timestamp                                 : Tue Dec  7 15:28:07 2021
Driver Version                            : 497.09
CUDA Version                              : 11.5

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : NVIDIA GeForce GTX 1050
    Product Brand                         : GeForce
    Product Architecture                  : Pascal
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : N/A
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : WDDM
        Pending                           : WDDM
    Serial Number                         : N/A
    GPU UUID                              : GPU-21a54983-7230-f355-5c03-1f4785f8b6e8
    Minor Number                          : N/A
    VBIOS Version                         : 86.07.93.00.1f
    MultiGPU Board                        : No
    Board ID                              : 0x100
    GPU Part Number                       : N/A
    Module ID                             : 0
    Inforom Version
        Image Version                     : N/A
        OEM Object                        : N/A
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1C9110DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x86D4103C
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 3
            Link Width
                Max                       : 16x
                Current                   : 8x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 7000 KB/s
        Rx Throughput                     : 121000 KB/s
    Fan Speed                             : N/A
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 3072 MiB
        Used                              : 80 MiB
        Free                              : 2992 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 2 MiB
        Free                              : 254 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
        Aggregate
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 48 C
        GPU Shutdown Temp                 : 102 C
        GPU Slowdown Temp                 : 97 C
        GPU Max Operating Temp            : 94 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : N/A
        Power Draw                        : N/A
        Power Limit                       : N/A
        Default Power Limit               : N/A
        Enforced Power Limit              : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 1 MHz
        SM                                : 1 MHz
        Memory                            : 405 MHz
        Video                             : 544 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Max Clocks
        Graphics                          : 1911 MHz
        SM                                : 1911 MHz
        Memory                            : 3504 MHz
        Video                             : 1708 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 13540
            Type                          : C+G
            Name                          : D:\Games\Epic Games\Launcher\Engine\Binaries\Win64\EpicWebHelper.exe
            Used GPU Memory               : Not available in WDDM driver model
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 14868
            Type                          : C+G
            Name                          : D:\Games\Epic Games\Launcher\Portal\Binaries\Win64\EpicGamesLauncher.exe
            Used GPU Memory               : Not available in WDDM driver model

Hi,

Thanks for reporting the issue. Unfortunately, we are not able to reproduce this issue locally using the same CUDA Toolkit version (11.5.1) and Driver version (497.09) on a GeForce GTX 1050Ti card.

To isolate the issue, can you please try to profile other CUDA sample/s:
Sample command:

nvprof --metrics all bandwithTest.exe

And see if profiling a single event or metric makes a difference.
Sample commands:

nvprof -e active_cycles <app>
nvprof -m achieved_occupancy <app>

Another option could be to check the status with an older CUDA toolkit say CUDA 11.4.