How to profile all attention layers in Bert inference?

Hi everyone
I am trying to profile attention layers in Bert inference. Profiled kernels are about 8000 and unfortunately I can’t load them once in Nsight compute application and I do not know how to partially load them. Another problem is that kernel names are so cryptic and I can not be sure which kernels are really for attention layers.
My main purpose is to profile the cache memory for attention layers for each token generation step and compare them with each other.
Can you help me what to do?
Thank you in advance

Hi,

It is indeed not possible to load only some of the profile results included in an NCU report file in the GUI. However, you can filter them on the commandline interface, see this section in the documentation, and following. With this, you can either display particular kernel results directly on the commandline, or use import and export to create a new report file that you could then display in the NCU-UI. Here is also a guide to some commonly used filtering options. You can also use these filter options directly when collecting profiling data.

In particular, if you already know in which kernels you are interested in, you can (regex) filter them by kernel name. If this is not the case, you can try to define NVTX ranges around the Python function calls you are most interested in, and use range replay to profile only those. You will have to specify these NVTX ranges for profiling as described here. Also note that certain limitations apply. Here you can find the NVTX package for Python.

Another problem is that kernel names are so cryptic and I can not be sure which kernels are really for attention layers.

Independently of the replay mode you use, you should be able to get an understanding from where a particular kernel (or range) was invoked from the Python call stack visible on the context page.

This topic was automatically closed after 10 hours. New replies are no longer allowed.