I tried to collect metrics of cu kernels, which were called by the Python code, on a remote machine equipped with eight A800 cards. The command used in the metric collection was “ncu --device 0 --target-processes all --section SpeedOfLight -o paged_attention -k paged_attention_v1_kernel python benchmark_paged_attention.py”.
However the promoted info showed there was no kernel profiled, as shown below.
$ ncu --device 0 --target-processes all --section SpeedOfLight -o paged_attention -k paged_attention_v1_kernel python benchmark_paged_attention.py
Namespace(batch_size=8, block_size=16, dtype=‘half’, head_size=128, kv_cache_dtype=‘auto’, num_kv_heads=8, num_query_heads=64, profile=False, seed=0, seq_len=4096, use_alibi=False, version=‘v2’)
==PROF== Connected to process 30384 (/home/xwentian/.conda/envs/vllm_a800/bin/python3.8)
Warming up…
Kernel running time: 306.206 us
==PROF== Disconnected from process 30384
==WARNING== No kernels were profiled.
This issue was likely related to the configuration made in the ncu command. Hence I wrote for suggestions here.
You should check what name ncu considers for these kernels by default, and if the -k argument you passed matches this name. You may want to remove the -k option for that purpose, to get the unfiltered list of kernel.
Note that both the printed and matched name variants can be configured using the --kernel-name-base and --print-kernel-base options.
I tried a test after receiving your recommendations, and just used the same command line to catch all the kernels. The result stilled showed no kernel(s) had been profiled by ncu command.
I listed the info promoted after running the ncu command as follows. Hope the info might be helpful for you to help me to get the info of kernels.
(vllm_a800) xwentian@g0015:/data/xwentian/vllm/benchmarks/kernels$
(vllm_a800) xwentian@g0015:/data/xwentian/vllm/benchmarks/kernels$ ncu --device 0 --target-processes all --section SpeedOfLight -o paged_attention python benchmark_paged_attention.py --version v1
Namespace(batch_size=8, block_size=16, dtype=‘half’, head_size=128, kv_cache_dtype=‘auto’, num_kv_heads=8, num_query_heads=64, profile=False, seed=0, seq_len=4096, use_alibi=False, version=‘v1’)
==PROF== Connected to process 2026979 (/home/xwentian/.conda/envs/vllm_a800/bin/python3.8)
==ERROR== ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see NVIDIA Development Tools Solutions - | NVIDIA Developer
Warming up…
Kernel running time: 347.515 us
==PROF== Disconnected from process 2026979
==WARNING== No kernels were profiled.
Hi, @xwentian
Thanks for the detailed output. ERR_NVGPUCTRPERM indicates that profile is not permitted in your machine now.
You need to follow NVIDIA Development Tools Solutions - | NVIDIA Developer to grant the permission.
The quickest way is using “sudo” to execute the command directly.
I used the following command lines to add my account in sudo and repeated the ncu command line again gut the promoted info still showed the collection of kernel metrics was failed as before. For convenience, I pasted my commands below.
su root
visudo (in /etc/sudoers added one line under root ALL=(ALL) ALL)
su xwentian
exec bash
cd /home/xwentian
source activate vllm_a800
(vllm_a800) xwentian@g0015:/data/xwentian/vllm/benchmarks/kernels$ cd /data/xwentian/vllm/benchmarks/kernels
(vllm_a800) xwentian@g0015:/data/xwentian/vllm/benchmarks/kernels$ sudo apt-get update
Hit:1 https://download.docker.com/linux/ubuntu focal InRelease
Hit:3 Index of /ubuntu focal InRelease
Hit:2 Index of /compute/cuda/repos/ubuntu2004/x86_64 InRelease
Hit:4 Index of /ubuntu focal InRelease
Hit:5 Index of /ubuntu-toolchain-r/test/ubuntu focal InRelease
Hit:6 Index of /ubuntu focal-updates InRelease
Hit:7 Index of /ubuntu focal-security InRelease
Hit:8 Index of /ubuntu focal-backports InRelease
Reading package lists… Done
(vllm_a800) xwentian@g0015:/data/xwentian/vllm/benchmarks/kernels$
(vllm_a800) xwentian@g0015:/data/xwentian/vllm/benchmarks/kernels$
(vllm_a800) xwentian@g0015:/data/xwentian/vllm/benchmarks/kernels$ apt-get update
Reading package lists… Done
E: Could not open lock file /var/lib/apt/lists/lock - open (13: Permission denied)
E: Unable to lock directory /var/lib/apt/lists/
W: Problem unlinking the file /var/cache/apt/pkgcache.bin - RemoveCaches (13: Permission denied)
W: Problem unlinking the file /var/cache/apt/srcpkgcache.bin - RemoveCaches (13: Permission denied)
(vllm_a800) xwentian@g0015:/data/xwentian/vllm/benchmarks/kernels$
(vllm_a800) xwentian@g0015:/data/xwentian/vllm/benchmarks/kernels$
(vllm_a800) xwentian@g0015:/data/xwentian/vllm/benchmarks/kernels$ ncu --device 0 --target-processes all --section SpeedOfLight -o paged_attention python benchmark_paged_attention.py --version v1
Namespace(batch_size=8, block_size=16, dtype=‘half’, head_size=128, kv_cache_dtype=‘auto’, num_kv_heads=8, num_query_heads=64, profile=False, seed=0, seq_len=4096, use_alibi=False, version=‘v1’)
==PROF== Connected to process 2315375 (/home/xwentian/.conda/envs/vllm_a800/bin/python3.8)
==ERROR== ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see NVIDIA Development Tools Solutions - | NVIDIA Developer
Warming up…
Kernel running time: 347.802 us
==PROF== Disconnected from process 2315375
==WARNING== No kernels were profiled.
(vllm_a800) xwentian@g0015:/data/xwentian/vllm/benchmarks/kernels$
Apparently, you still didn’t get the full permission to execute the profiler by using this method.
If you need to use your account, then please follow
To allow access for any user, create a file with the .conf extension containing options nvidia NVreg_RestrictProfilingToAdminUsers=0 in /etc/modprobe.d and reboot to make it take effect.
Before getting your feedback to my previous question, I made changes in /etc/sudoers and added the path /usr/local/cuda-12.1/bin in the line of "Defaults secure_path " so that ncu can be used along with sudo.
==ERROR== Profiling failed because a driver resource was unavailable. Ensure that no other tool (like DCGM) is concurrently collecting profiling data. See 2. Kernel Profiling Guide — NsightCompute 12.5 documentation for more details.
==ERROR== Failed to profile “distribution_elementwise_grid…” in process 2807896
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
(vllm_a800) xwentian@g0015:/data/xwentian/vllm/benchmarks/kernels$
I have to wait for sometime so that the machine can be reboot safely.
BTW, if the above new question is related to DCGM, then the command dcgmi profile --pause should be used before profiling the cuda kernels with ncu. Thereafter it is just needed to restart by using dcgmi profile --resume. Are such operations appropriate for DCGM and my profiling work on the CUDA kernels without having to reboot the entire machine?