Why nsys dose not output cudaSetDevice api in cuda_api_sum?

I tried to analyzed cuda runtime api by nsys, but I found some runtime API called dose not output in cuda_api_sum.

My code as follows:

using mt = float;
int main(){
  size_t sz = 4096;
  size_t msz = sz*sz;
  dim3 grid = dim3(sz/16/2, sz/16);
  dim3 block = dim3(16,16);
  mt *d_MatA, *d_MatB;
  float * host_a = (float *)malloc(sizeof(float)*msz);

  cudaMalloc(&d_MatA, sizeof(float)*msz);
  cudaMalloc(&d_MatB, sizeof(float)*msz);

  cudaMemcpy(d_MatA, host_a, sizeof(float)*msz, cudaMemcpyDefault );

use nsys command as follows:

nvcc  -arch=sm_80 res.cu ; nsys profile  --stats=true  a.out

output as follows:

[5/8] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls   Avg (ns)    Med (ns)   Min (ns)  Max (ns)  StdDev (ns)           Name
 --------  ---------------  ---------  ----------  ----------  --------  --------  -----------  ----------------------
     97.4         14039502          1  14039502.0  14039502.0  14039502  14039502          0.0  cudaMemcpy
      1.5           215939          2    107969.5    107969.5     67436    148503      57323.0  cudaFree
      1.1           153313          2     76656.5     76656.5     50035    103278      37648.5  cudaMalloc
      0.0             2004          1      2004.0      2004.0      2004      2004          0.0  cuModuleGetLoadingMode

I used cudaSetDevice(0) in code, but why it doesn’t show in nsys output ?

You can find a list of the CUDA calls that are traced by default in the user guide at User Guide — nsight-systems 2024.2 documentation (direct link, the forum software just mungs it up).

We only trace calls that can take a long time and are capable of disrupting the flow of work.

@hwilper, thanks for your answer, but i still have some question.

  1. If I want trace all API in CLI, Is there any args in nsys to implement it.

  2. The dafualt Driver API in User Guide — nsight-systems 2024.2 documentation, there are many apis like this


    what different betweent API with _v2 and API without _v2, i don’t see the former in the cuda toolkit documentation.

  3. I called cudaMemcpy API in my test code, in my understanding, cudaMemcpy API will call cuMemAlloc, why nsys output does’t show it, because cuMemAlloc also in default list. If I want to output this low-level call in nsys, how do I do it.

@skottapalli can you explain a little more about how cuda memory allocations work?

  1. The GUI has an option “skip some API calls” under “Collect CUDA trace”. It is checked by default. You could uncheck it and try profiling again. In the CLI, you could add NSYS_CONFIG_DIRECTIVES=‘CudaSkipSomeApiCalls=false’ in front of your nsys command line

  2. These are different versions of CUDA APIs. Nsight Systems supports older versions of CUDA, so it includes the APIs which are deprecated in the latest CUDA toolkit.

  3. Only the APIs invoked by the user application are traced. Any calls made by the CUDA driver underneath the covers are not made visible on nsys timeline. If the user application directly calls cuMemAlloc, then it will show up in nsys.

Do you mean cudaMalloc API invokes cuMemAlloc?

NSYS does not show nested CUDA Runtime to CUDA Driver API calls. I believe this can still be observed in Nsight Compute interactive profiling mode in the API window.