How does nsys tool profile cuda libraries, like cublas, cudnn, etc.?

We can see cuda related library can be traced on timeline, and CUPTI can support openACC profiling, so how does nsys tracked other libraries, like cudnn, cublas, etc. ?

If you are asking specifically about cuDNN and the like, they are cuda libraries and most of our support for those also comes through CUPTI.

For other NVIDIA libraries, like NCCL or NVSHMEM, we have the library authors instrument them with the same NVTX annotation libraries that we suggest users use.

If you are asking about other API trace, like OpenMP or MPI, there are two basic methodologies, either annotation on the fly or interposers.

Thanks for quick reply.

You mean CUPTI is used for tracking cuDNN library, right? Whether cuBLAS is tracked using the same method as well?

From the CUPTI guide, CUPTI :: CUPTI Documentation, it looks cuDNN and cuBLAS are not mentioned for support?

image

For MPI, i see the guide above says using NVTX, for OMP, using OMPT Interface, right?

I misspoke. The cu* libraries are NVIDIA written and we have embedded NVTX in them.

For MPI specifically, see (it is a direct link, I just can’t change the text) User Guide :: Nsight Systems Documentation

In particular, we only support OpenMPI and MPICH based MPI variants, but if you want to use another one you can use https://github.com/NVIDIA/cuda-profiler/tree/master/nvtx_pmpi_wrappers so you can take a look at that to see in general how we do automatic annotations for NVTX in MPI.

Ok, thanks for your clarification.

Next, let me narrow down my question and only focus on cu* libraries trace. As my understanding, NVTX should instrument cu* libraries in runtime? It supports dynamic library instrumentation or both dynamic and static library? My understanding is the instrumentation can only perform on dynamic library, is it right?

To annotate cu* libraries, there are some special configurations or somethings to be done in these libraries?

@afroger can you go into more detail here?

NVTX is an annotation API. It’s a header-only library that lets developers instrument their code.

It supports dynamic library instrumentation or both dynamic and static library?

There is no dynamic instrumentation involved. Developers use the NVTX API to annotate their code. Those become tracepoints a tool can trace at runtime. Those tracepoints are no-op when a tool isn’t involved in the process.

The following CUDA library are directly instrumented with NVTX:

  • cuBLAS
  • cuCIM
  • cuDF
  • cuML
  • cuSOLVER
  • cuSPARSE
  • NCCL
  • NVSHMEM
  • OptiX
  • TensorRT

For most of those, you can simply use nsys profile --trace nvtx to trace the annotations. For others, additional environment variables need be set for the NVTX annotations to be emitted at runtime (i.e., they have checks at runtime that control whether NVTX API are called or not). We have individual --trace switch in such case, (e.g., --trace cublas, --trace cusparse, etc.)

Very clear, thanks.

I have some other concerns below, here i use cuBLAS as one example and the library is linked in one program:

  1. NVTX starts to instrument cuBLAS library until using nsys to profile one program linked cuBLAS library? In another word, if there is no tool involved, the API calls in library keep normal execution?
  2. I see many cuBLAS API calls can be presented on timeline, this means NVTX instruments all functions called in the program, this is implemented at runtime by nsys?
  3. As we know some pin tools provide instrumentation APIs to insert into the source code, do we need to do these insertions in the library in advance or runtime instrumented by nsys?
  4. If cuBLAS is one static library linked in the program, NVTX also supports instrument source? I mean whether NVTX cares how the library is linked?
  5. Whether cuDNN is instrumented with NVTX?
  6. We also see CUDA API calls can be presented on timeline, it means CUDA API is also instrumented with NVTX?
  7. I see there are some debug APIs in cuBLAS, these APIs are related to NVTX instrumentation?

I recommend you go look at the NVTX source code. It’s open-source and can be found in the NVTX GitHub repository.

Basically, on the first call to an NVTX API, the NVTX header-only implementation looks at the NVTX_INJECTION{32,64}_PATH environment variable and tries to load the library specified at that path. When successful, it queries the InitializeInjectionNvtx2 symbol and calls the API if found. At that point, a tool library got loaded and has a chance to execute code within the process. A table of function pointers is passed as parameter of the InitializeInjectionNvtx2 function which allows the tool to entirely replace the whole NVTX API implementation with its own. If any of those steps fail, the default NVTX API implementation from the header-only library is called. And that implementation is basically a no-op for each NVTX API.

As you can see, there is no dynamic instrumentation involved. You can see an NVTX API call from within a process as a static tracepoint that can be traced independently of whether it’s in a static library, shared library or an executable.

To answer your questions.

NVTX starts to instrument cuBLAS library until using nsys to profile one program linked cuBLAS library?

No, NVTX doesn’t instrument anything. They are defined at compile time. NVTX API calls are similar to static tracepoints.

In another word, if there is no tool involved, the API calls in library keep normal execution?

No, if no tool is involved, the default NVTX API implementation embedded in the header is called. It’s mostly equivalent to a no-op; only a couple of CPU instructions end up being called. You can look at the disassembly of a program instrumented with NVTX.

I see many cuBLAS API calls can be presented on timeline, this means NVTX instruments all functions called in the program, this is implemented at runtime by nsys?

Nsight Systems replace the default implementation of the NVTX API with its own. Observability is limited based on how much NVTX annotations are present in the cuBLAS library. cuBLAS actually provide additional control through environment variable to limit the amount of NVTX API calls made by the library. This is mostly to provide more control on the level of detail. Typically, using --trace cublas-verbose, you’ll see more cuBLAS trace than using --trace cublas. The former additionally sets an environment variable that cuBLAS checks. Based on this, it calls the NVTX API in additional places.

As we know some pin tools provide instrumentation APIs to insert into the source code, do we need to do these insertions in the library in advance or runtime instrumented by nsys?

NVTX is an instrumentation/annotation API. You add NVTX API calls into your library/executable and that allows a tool to trace it. Again, I recommend looking at the NVTX GitHub page. There is no runtime instrumentation done by Nsight Systems to trace NVTX.

If cuBLAS is one static library linked in the program, NVTX also supports instrument source? I mean whether NVTX cares how the library is linked?

As explained above, it doesn’t matter what is being instrumented (static library, shared library, executable, JITed code). It’s just going to work.

Whether cuDNN is instrumented with NVTX?

cuDNN is not instrumented with NVTX. Nsight Systems does provide the ability to trace cuDNN but it’s done through runtime instrumentation. Nothing to do with NVTX.

We also see CUDA API calls can be presented on timeline, it means CUDA API is also instrumented with NVTX?

CUDA is not instrumented with NVTX. Nsight Systems trace CUDA by using the CUPTI SDK. CUPTI itself relies on internal mechanism to trace CUDA. It does not rely on NVTX.

I see there are some debug APIs in cuBLAS, these APIs are related to NVTX instrumentation?

I’m not sure what debug APIs you’re referring to. And the answer is likely “no”.

Many thanks to you.

From your answer, i did a summary below:

  1. NVTX is similar to static tracepoints, which pre-defined in cu* libraries by default. When using nsys, all the annotations are replace by its own, this depends on some options, like --trace cublas, etc.
  2. I see NVTX also supports Tensorflow and TyTorch frameworks. This is implemented like cu* libraries tracing you mentioned above?
  3. As you mentioned, cuDNN is instrumented at runtime, which is like what pin tools do?

NVTX is similar to static tracepoints, which pre-defined in cu* libraries by default. When using nsys, all the annotations are replace by its own, this depends on some options, like --trace cublas, etc.

Mostly correct. Saying “annotations are replaced by its own” is a bit weird though. Saying that Nsight Systems’ tool library gets loaded by the NVTX shim and replaces the default implementation of the NVTX API would be more accurate.

Just FYI, the only thing --trace cublas does is:

  1. Set NVTX_INJECTION64_PATH to let the NVTX shim load the Nsight Systems tool library which replaces the default NVTX API implementation
  2. Set CUBLASXT_NVTX_LEVEL=1 and CUBLAS_NVTX_LEVEL=1 to make the cuBLAS library call into the NVTX API (note that cublas-verbose would set those to 2 to enable more verbosity)

I see NVTX also supports Tensorflow and TyTorch frameworks. This is implemented like cu* libraries tracing you mentioned above?

Again, I wouldn’t say NVTX supports Tensorflow and PyTorch. The Tensorflow and PyTorch frameworks are instrumented with NVTX (i.e., they added “nvtxMark*”, etc. calls into their library.) And that allows NVIDIA tools to trace them.

Thanks again.

Could you introduce the runtime trace for cuDNN? Why we don’t use NVTX for tracing the library? Thanks.

Hi,
Whether all cu* libraries supporting NVTX have been pre-instrumented in the functions, and the NVTX calls will be replaced by tool when needed?

For example, in the cuBLAS function of cublasStatus_t cublasSetMatrix(int rows, int cols, int elemSize, const void *A, int lda, void *B, int ldb); NVTX call has been pre-instrumented, but no-ops for regular execution until launched by tool, the NVTX calls are redirected to the functions in the tool? For nsight systems tool, the NVTX calls are implemented in /usr/local/cuda-12.0/include/nvtx3 , right? Thanks.