My system is V100 with the following information:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.6 |
|-------------------------------±---------------------±---------------------+
NVIDIA Nsight Systems version 2021.5.2.53-28d0e6e
sudo sh -c “echo 2 >/proc/sys/kernel/perf_event_paranoid”
/bin/bash: /proc/sys/kernel/perf_event_paranoid: Read-only file system
nsys status -e
Timestamp counter supported: No
Sampling Environment Check
Linux Kernel Paranoid Level = -1: OK
Linux Distribution = Ubuntu
Linux Kernel Version = 5.0.0-1032-azure: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Not Available
Sampling Environment: OK
The trace-fork-before-exec switch is not entirely safe and it does not guarantee traces. If you are using the Python multiprocessing module, could switch to using spawn for the start method and avoid using the trace-fork-before-exec switch?
from multiprocessing import set_start_method
# ...
if __name__ == '__main__':
set_start_method('spawn')
# ...
Also, several CUDA tracing bugs have been fixed in nsys since the 2021.5 release. Would it be possible for you to upgrade to the 2022.4 nsys release and try your collection again?
One more suggestion would be to simplify your nsys command line to help us narrow down the issue with CUDA tracing. Could you run this command to see if you achieve a clean CUDA trace? nsys profile --capture-range=cudaProfilerApi --force-overwrite true -s none -t cuda -o moeBaseline python ...
Please note that I used the NVIDIA Nsight Systems 2021.5.1 for viewing the shared profile, but I still cannot see the kernel profile data.
I ran the suggested simplified command while adding the start and stop profile methods in the python script. That’s the output of nsys stats: profileOut.txt (6.7 KB)
here’s the torch command I used to start:
d0 = torch.device(“cuda”)
with torch.cuda.device(d0):
torch.cuda.profiler.cudart().cudaProfilerStart()
The perf_event_paranoid level does not matter in this case. It only affects CPU profiling operations.
Can you run the nvidia-smi command and post the results?
Can you also collect and injection log? To collect an injection log, run the following command;
/opt/nvidia/nsight-systems/2022.4.2/target-linux-x64/nsys profile --force-overwrite true -s none -t cuda -e NVLOG_CONFIG_FILE=/opt/nvidia/nsight-systems/2022.4.2/host-linux-x64/nvlog.config.template vectorAdd
The injection log will be named nsys-ui.log and will be found in your working directory. Please share the injection log with us.
Can you do another experiment that can help us narrow down the issue?
Please follow the README.md instructions to build and use a cuda injection library in the attached cuda-injection-library-linux.tar.gz file. When you run your application following the Use instructions, please capture an INJECTION_LOG_FILE file and upload it to this Forum discussion. cuda-injection-library-linux.tar.gz (5.9 KB)
I found CUPTI in /usr/local/cuda/lib64, added to LD_LIBRARY_PATH, and compiled well.
That’s how I run vector add now:
CUDA_INJECTION64_PATH=/home/hossamamer/young/cuda-injection-library-linux INJECTION_LOG_FILE=/home/hossamamer/young/cuda-samples/Samples/0_Introduction/vectorAdd/log.txt ./test
Does not seem to output the intended file - Is this correct?
Not sure if this is relevant, now when I type make for vector add, this is what I get:
The CUDA_INJECTION64_PATH environment variable should be set as follows; CUDA_INJECTION64_PATH=/home/hossamamer/young/libToolsInjectionCuda.so
assuming you didn’t change the name of the resulting library created with the make command.
The vectorAdd application doe not need to be recompiled to do this test.
The command would be CUDA_INJECTION64_PATH=/home/hossamamer/young/libToolsInjectionCuda.so INJECTION_LOG_FILE=/home/hossamamer/young/cuda-samples/Samples/0_Introduction/vectorAdd/log.txt ./test
That was the command used:
CUDA_INJECTION64_PATH=/home/hossamamer/young/cuda-injection-library-linux/libToolsInjectionCuda.so INJECTION_LOG_FILE=/home/hossamamer/young/cuda-samples/Samples/0_Introduction/vectorAdd/log.txt ./test
That was the output:
00:24:23.586.954|19432|Lib.cpp:566[InitializeInjection]: Initializing CUDA tracing
00:24:23.601.866|19432|Lib.cpp:348[EnableCollection]: Starting collection
00:24:23.601.932|19432|Lib.cpp:580[InitializeInjection]: CUDA tracing initialized
00:24:23.803.869|19432|Lib.cpp:221[BufferRequested]: Buffer requested
00:24:23.809.004|19432|Lib.cpp:339[EnableUvmActivity]: Initialized UVM
00:24:24.214.633|19432|Lib.cpp:514[AtExitHandler]: Flushing CUPTI buffers on exit
00:24:24.217.414|19432|Lib.cpp:230[BufferCompleted]: Buffer completed
00:24:24.217.474|19432|Lib.cpp:116[ProcessActivityRecord]: Device record received
00:24:24.217.485|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetCount’
00:24:24.217.493|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGet’
00:24:24.217.500|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetName’
00:24:24.217.508|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceTotalMem_v2’
00:24:24.217.516|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.532|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.540|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.546|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.554|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.561|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.568|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.574|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.581|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.588|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.595|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.602|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.609|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.616|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.623|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.630|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.645|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.653|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.660|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.666|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.673|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.681|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.688|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.694|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.701|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.708|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.715|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.721|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.748|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.758|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.764|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.771|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.778|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.785|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.792|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.799|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.806|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.813|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.820|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.827|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.833|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.841|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.848|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.855|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.862|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.869|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.875|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.883|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.889|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.896|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.903|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.910|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.917|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.924|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.931|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.938|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.945|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.951|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.958|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.965|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.972|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.979|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.986|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.992|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.217.999|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.006|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.016|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.023|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.030|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.036|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.043|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.050|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.057|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.064|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.071|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.078|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.085|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.092|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.099|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.106|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.113|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.120|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.127|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.134|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.141|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.148|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.155|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.162|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.177|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.184|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.191|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.197|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.204|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.211|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.218|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.225|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetUuid’
00:24:24.218.232|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.239|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.246|19432|Lib.cpp:160[ProcessActivityRecord]: Driver API record received: ‘cuDeviceGetAttribute’
00:24:24.218.253|19432|Lib.cpp:124[ProcessActivityRecord]: Device context record received
00:24:24.218.260|19432|Lib.cpp:169[ProcessActivityRecord]: Runtime API record received: ‘cudaMalloc_v3020’
00:24:24.218.266|19432|Lib.cpp:169[ProcessActivityRecord]: Runtime API record received: ‘cudaMalloc_v3020’
00:24:24.218.273|19432|Lib.cpp:169[ProcessActivityRecord]: Runtime API record received: ‘cudaMalloc_v3020’
00:24:24.218.283|19432|Lib.cpp:136[ProcessActivityRecord]: Memory copy record received
00:24:24.218.291|19432|Lib.cpp:169[ProcessActivityRecord]: Runtime API record received: ‘cudaMemcpy_v3020’
00:24:24.218.298|19432|Lib.cpp:136[ProcessActivityRecord]: Memory copy record received
00:24:24.218.305|19432|Lib.cpp:169[ProcessActivityRecord]: Runtime API record received: ‘cudaMemcpy_v3020’
00:24:24.218.371|19432|Lib.cpp:185[ProcessActivityRecord]: Kernel launch record received: ‘vectorAdd(float const*, float const*, float*, int)’
00:24:24.218.379|19432|Lib.cpp:169[ProcessActivityRecord]: Runtime API record received: ‘cudaLaunchKernel_v7000’
00:24:24.218.386|19432|Lib.cpp:169[ProcessActivityRecord]: Runtime API record received: ‘cudaGetLastError_v3020’
00:24:24.218.394|19432|Lib.cpp:136[ProcessActivityRecord]: Memory copy record received
00:24:24.218.401|19432|Lib.cpp:169[ProcessActivityRecord]: Runtime API record received: ‘cudaMemcpy_v3020’
00:24:24.218.408|19432|Lib.cpp:214[ProcessActivityRecord]: Processing CUPTI record kind 45
00:24:24.218.415|19432|Lib.cpp:169[ProcessActivityRecord]: Runtime API record received: ‘cudaFree_v3020’
00:24:24.218.422|19432|Lib.cpp:214[ProcessActivityRecord]: Processing CUPTI record kind 45
00:24:24.218.429|19432|Lib.cpp:169[ProcessActivityRecord]: Runtime API record received: ‘cudaFree_v3020’
00:24:24.218.436|19432|Lib.cpp:214[ProcessActivityRecord]: Processing CUPTI record kind 45
00:24:24.218.443|19432|Lib.cpp:169[ProcessActivityRecord]: Runtime API record received: ‘cudaFree_v3020’
00:24:24.218.450|19432|Lib.cpp:244[BufferCompleted]: All records were processed