How to get all kernels name?

is there any way to see all the kernel used by my application??
since when I use nsight compute to profile my application without select kernel , the nsight compute will get stuck,
so I’m thinking to get all kernels and profile one by one.
there coms the question: how to get all the kernels used by my application??

You can use Nsight Systems to trace your application. This not only allows you to find all launched CUDA kernels, but also to identify which ones are valuable optimization targets.

the nsight compute will get stuck,

Can you elaborate more what exact behavior you are seeing? Is the tool completely hung at a specific kernel, is it hung in between profiling kernels, or is still just making very slow progress? Can you share the command line you used to launch it, as well as the relevant command line output?

Which version of the tool are you using, and on which platform? If you haven’t yet, I recommend trying the latest version 2020.1 which has many bug fixes and improvements.

Finally, Nsight Compute uses a method called Kernel Replay if not all requested metrics can be collected in a single pass. This can prevent certain applications from being profiled, e.g. if the accessed GPU memory is very large, or if your kernels have certain CPU/GPU interactions. In this case, you can try to profile a single metric at a time, by using the --metrics option, so that replaying kernels is not necessary.

thank you for your answer,
I use 2020.1 version,and the instruction is:
/usr/local/NVIDIA-Nsight-Compute/nv-nsight-cu-cli --metrics sm__inst_executed_pipe_tensor_op_hmma.sum --kernel-id “:::” /keras_flops_parallel_mix.py
when is stuck,the stuck kernel is not the same every time,and I use top instruction and see that python command use more than 800% cpu

I aim to find all tensor core flops,but the nsight system tool doesn’t tell which kernel use tensor core,is there any way to get all the tensor core flops?

by the way I want to ask another question:
why each time I run this command :

/usr/local/NVIDIA-Nsight-Compute/nv-nsight-cu-cli --metrics sm__inst_executed_pipe_tensor_op_hmma.sum --kernel-id “::regex:[1].*$:” /usr/bin/python3 keras_flops_parallel_mix.py

and I accumulate all the None-zero data together ,the answer is different??


  1. ^EigenMetaKernel ↩︎

sm__inst_executed_pipe_tensor_op_hmma.sum should be a single-pass metric already, so this concern can be ruled out, it seems. What GPU are you trying to collect this on, and on which driver (can you provide the output of nvidia-smi, for example)?

is there any way to get all the tensor core flops

Nsight Compute would be the right tool for this, yes. if I understand correctly, collecting the data works for all kernels but the ones named EigenMetaKernel, is that correct?

I use Tesla V100-SXM2-32GB, cuda 10.1
part of my ncu instruction output is like this:

==PROF== Profiling
“volta_fp16_s884gemm_fp16_64x64_ldg8_f2f_nn”: 0%…50%…100% - 1 pass
==PROF== Profiling “volta_fp16_s884gemm_fp16_64x64_ldg8_f2f_nn”: 0%…50%…100% - 1 pass
==PROF== Profiling “volta_fp16_s884gemm_fp16_64x64_ldg8_f2f_nn”: 0%…50%…100% - 1 pass
==PROF== Profiling “volta_fp16_s884gemm_fp16_64x64_ldg8_f2f_nn”: 0%…50%…100% - 1 pass
==PROF== Profiling “volta_fp16_s884gemm_fp16_64x64_ldg8_f2f_nn”: 0%…50%…100% - 1 pass
==PROF== Profiling “volta_fp16_s884gemm_fp16_64x64_ldg8_f2f_nn”: 0%…50%…100% - 1 pass
==PROF== Profiling “volta_fp16_s884gemm_fp16_64x64_ldg8_f2f_nn”: 0%…50%…100% - 1 pass
==PROF== Profiling “volta_fp16_s884gemm_fp16_64x64_ldg8_f2f_nn”: 0%…50%…100% - 1 pass
==PROF== Profiling “volta_fp16_s884gemm_fp16_64x64_ldg8_f2f_nn”: 0%…50%…100% - 1 pass
==PROF== Profiling “volta_fp16_s884gemm_fp16_64x64_ldg8_f2f_nn”: 0%…50%…100% - 1 pass
==PROF== Profiling “volta_fp16_s884gemm_fp16_64x64_ldg8_f2f_nn”: 0%…50%…100% - 1 pass
==PROF== Profiling “volta_fp16_s884gemm_fp16_64x64_ldg8_f2f_nn”: 0%…50%…100% - 1 pass
==PROF== Profiling “volta_fp16_s884gemm_fp16_64x64_ldg8_f2f_nn”: 0%…50%…100% - 1 pass
==PROF== Profiling “volta_fp16_s884gemm_fp16_64x64_ldg8_f2f_nn”: 0%…50%…100% - 1 pass
==PROF== Profiling “volta_fp16_s884gemm_fp16_64x64_ldg8_f2f_nn”: 0%…50%…100% - 1 pass
==PROF== Profiling “volta_fp16_s884gemm_fp16_64x64_ldg8_f2f_nn”: 0%…50%…100%
1/1 [==============================] - 14s 14s/step - loss: 7.2212 - accuracy: 0.0000e+00 - val_loss: 13.2001 - val_accuracy: 0.0000e+00
trialid: lDbOu, train successfully!!

Blockquote - 1 pass
==PROF== Disconnected from process 8690
[8690] python3.5@127.0.0.1
volta_fp16_s884gemm_fp16_128x256_ldg8_f2f_nn, 2020-Jul-16 18:09:58, Context 7, Stream 166
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
sm__inst_executed_pipe_tensor_op_hmma.sum inst 159744
---------------------------------------------------------------------- --------------- ------------------------------

volta_fp16_s884gemm_fp16_128x256_ldg8_f2f_nn, 2020-Jul-16 18:09:58, Context 7, Stream 166
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
sm__inst_executed_pipe_tensor_op_hmma.sum inst 319488
---------------------------------------------------------------------- --------------- ------------------------------

Volta_hmma_implicit_gemm_fprop_fp32_nhwc_64x32x64x1_1x3x3x0x1, 2020-Jul-16 18:09:58, Context 7, Stream 166
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
sm__inst_executed_pipe_tensor_op_hmma.sum inst 451584
---------------------------------------------------------------------- --------------- ------------------------------

volta_fp16_s884gemm_fp16_128x256_ldg8_f2f_nn, 2020-Jul-16 18:09:59, Context 7, Stream 166
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
sm__inst_executed_pipe_tensor_op_hmma.sum inst 319488
---------------------------------------------------------------------- --------------- ------------------------------

collecting the data works for all kernels but the ones named EigenMetaKernel – that’s right, but each time I run that instruction, It will get stuck randomly after 10 or more minutes.

so my way to get all tensor core flops (since I ran ncu without appoint kernel , the cmd will get stuck) :

  1. get all kernel names, but I dont’t how to get . It is not suitable to use nsight gui to see the kernels, that is too complex,since I want to do this automately by script;

  2. use sm__inst_executed_pipe_tensor_op_hmma.sum to get appoited kernel’s tensor core flops;

  3. sum all kernel’s tensor core flops up

is the above method correct? , if not, do you have any way to collect all the kernels tensor core flops without getting stuck?

thank you very much!!

The method you are describing sounds reasonable to me.

and I accumulate all the None-zero data together ,the answer is different

I checked with my team on this question, but I don’t have a complete answer, yet. I will update here once I have more information. Basically, given the current information, I would expect the values to be consistent across runs, as long as the application is fully deterministic. If the app you are running does some form of dynamic load balancing, randomized execution or input data, or similar, the number of executed instructions can vary.

I would also help us if you could provide the application and input data you are using for us to check on the hang internally, to see if we can reproduce and fix the issue.

Thanks.