No CUDA kernels shown in nsys profiler timeline when using dynamic parallelism

Running Nsight Systems fails to show my kernel in the timeline view when there is dynamic parallelism in my code. Sometimes, providing --gpuctxsw=true --trace-fork-before-exec=true to the profiler causes all kernel activity to be displayed under “Run” and not by function name, as in the image below.


More often I get no indication from the timeline that my CUDA kernel ran. It’s very similar to the image above except the dGPU (Tesla V100-PCIE-32GB) sections are not present.

Here is a minimal example to reproduce this result:

#include <cstddef>
#include <cstdio>
#include <cuda_runtime.h>
#include <cuda.h>

__global__ void add_one(float *val)
{
    *val = *val + 1;
}

__global__ void vector_add(const float *lhs, const float *rhs, float *result, std::size_t len)
{
    int index = threadIdx.x;
    int stride = blockDim.x;
    for (int i = index; i < len; i += stride)
    {
        result[i] = lhs[i] + rhs[i];
        add_one<<<1,1>>>(result + i); // works fine when this line is commented
    }
}

void add(const float *lhs, const float *rhs, float *result, std::size_t len)
{
    float *lhs_dev = nullptr, *rhs_dev = nullptr, *result_dev = nullptr;
    cudaMalloc(&lhs_dev, len * sizeof(float));
    cudaMalloc(&rhs_dev, len * sizeof(float));
    cudaMalloc(&result_dev, len * sizeof(float));
    cudaMemcpy(lhs_dev, lhs, len * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(rhs_dev, rhs, len * sizeof(float), cudaMemcpyHostToDevice);

    dim3 block(32);
    dim3 grid((len + block.x - 1) / block.x);

    printf("Running kernel with config: <<<%d, %d>>>>\n", grid.x, block.x);
    vector_add<<<grid, block>>>(lhs_dev, rhs_dev, result_dev, len);

    cudaDeviceSynchronize();

    cudaMemcpy(result, result_dev, len * sizeof(float), cudaMemcpyDeviceToHost);
}

int main(void)
{
    constexpr int size = 6;
    float lhs[size] = {1, 2, 3, 4, 5, 6};
    float rhs[size] = {2, 3, 4, 5, 6, 7};
    float result[size];

    add(lhs, rhs, result, size);

    printf("%f %f %f %f %f %f\n", result[0], result[1], result[2], result[3], result[4], result[5]);

    return 0;
}

Running the program, I get the correct output as expected:

./example_cuda_exe 
Running kernel with config: <<<1, 32>>>>
4.000000 6.000000 8.000000 10.000000 12.000000 14.000000

Why aren’t vector_add and add_one showing up on the profiler timeline? Or when they do, why do all kernels show as Run under dGPU?

I tried running the example code above on another computer, building with:
/usr/local/cuda/bin/nvcc -x cu -arch=sm_60 -rdc=true -lcudadevrt foo.cu

Running with nvprof (nsys isn’t installed) showed the 2 kernels as expected. So something strange is happening on the first computer that prevents nvprof/nsys from correctly profiling the kernels.

The profiler works on the computer with a P100, but doesn’t on the computer with a V100. Does NSight Systems not support CDP on cards with Compute Capability >= 7?

Running with nvprof on the V100 I get this warning:
Warning: CDP tracing and profiling are not supported on devices with compute capability 7.0 and later.

With nsys however, I don’t see any warning.

Hi, I think i have the same problem as you: I tried running some sample-projects using CDP (cdpQuadtree, cdpBezierTesselation) on Windows with a RTX3070 and also see no kernels in the timeline:
The Screenshot is from tracing the sample cdpQuadtree:

The warnings are just about Unified Memory tracing not working, because this system does not support P2P Access.

Versions:
Driver: 357.30
Nsight Systems: 2020.4.1
CUDA Toolkit: 11.1

other Samples (like eigenvalues or others) work fine and the kernels show up so it seems to be caused by the usage of dynamic paralellism: I would not care if child kernels cannot be displayed, but all kernels disappearing from the timeline by having a single CDP-Kernel in a project does not seem right.

If anyone has fixes, think I missed something or needs further information, I would be glad to hear from you.

Best,

Michael

I can confirm that Nsight Systems does not profile ANY kernels from a library/.cu file where CDP is used. Note that the kernel using the CDP does not need to be used in the parent app, its presence in the .cu file is sufficient for all kernels from the .cu to be unprofilable. Kernels from other libs, eg. cudnn or cublas are shown just fine.

My env: Windows 10, RTX 2060, cuda 11.1 (with nsight systems 2020.3.4)