No CUDA kernels shown in nsys profiler timeline when using dynamic parallelism

gaggle.gaggle · May 14, 2020, 2:10am

Running Nsight Systems fails to show my kernel in the timeline view when there is dynamic parallelism in my code. Sometimes, providing --gpuctxsw=true --trace-fork-before-exec=true to the profiler causes all kernel activity to be displayed under “Run” and not by function name, as in the image below.

More often I get no indication from the timeline that my CUDA kernel ran. It’s very similar to the image above except the dGPU (Tesla V100-PCIE-32GB) sections are not present.

Here is a minimal example to reproduce this result:

#include <cstddef>
#include <cstdio>
#include <cuda_runtime.h>
#include <cuda.h>

__global__ void add_one(float *val)
{
    *val = *val + 1;
}

__global__ void vector_add(const float *lhs, const float *rhs, float *result, std::size_t len)
{
    int index = threadIdx.x;
    int stride = blockDim.x;
    for (int i = index; i < len; i += stride)
    {
        result[i] = lhs[i] + rhs[i];
        add_one<<<1,1>>>(result + i); // works fine when this line is commented
    }
}

void add(const float *lhs, const float *rhs, float *result, std::size_t len)
{
    float *lhs_dev = nullptr, *rhs_dev = nullptr, *result_dev = nullptr;
    cudaMalloc(&lhs_dev, len * sizeof(float));
    cudaMalloc(&rhs_dev, len * sizeof(float));
    cudaMalloc(&result_dev, len * sizeof(float));
    cudaMemcpy(lhs_dev, lhs, len * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(rhs_dev, rhs, len * sizeof(float), cudaMemcpyHostToDevice);

    dim3 block(32);
    dim3 grid((len + block.x - 1) / block.x);

    printf("Running kernel with config: <<<%d, %d>>>>\n", grid.x, block.x);
    vector_add<<<grid, block>>>(lhs_dev, rhs_dev, result_dev, len);

    cudaDeviceSynchronize();

    cudaMemcpy(result, result_dev, len * sizeof(float), cudaMemcpyDeviceToHost);
}

int main(void)
{
    constexpr int size = 6;
    float lhs[size] = {1, 2, 3, 4, 5, 6};
    float rhs[size] = {2, 3, 4, 5, 6, 7};
    float result[size];

    add(lhs, rhs, result, size);

    printf("%f %f %f %f %f %f\n", result[0], result[1], result[2], result[3], result[4], result[5]);

    return 0;
}

Running the program, I get the correct output as expected:

./example_cuda_exe 
Running kernel with config: <<<1, 32>>>>
4.000000 6.000000 8.000000 10.000000 12.000000 14.000000

Why aren’t vector_add and add_one showing up on the profiler timeline? Or when they do, why do all kernels show as Run under dGPU?

gaggle.gaggle · May 15, 2020, 11:26pm

I tried running the example code above on another computer, building with:
/usr/local/cuda/bin/nvcc -x cu -arch=sm_60 -rdc=true -lcudadevrt foo.cu

Running with nvprof (nsys isn’t installed) showed the 2 kernels as expected. So something strange is happening on the first computer that prevents nvprof/nsys from correctly profiling the kernels.

gaggle.gaggle · May 20, 2020, 9:27pm

The profiler works on the computer with a P100, but doesn’t on the computer with a V100. Does NSight Systems not support CDP on cards with Compute Capability >= 7?

Running with nvprof on the V100 I get this warning:
Warning: CDP tracing and profiling are not supported on devices with compute capability 7.0 and later.

With nsys however, I don’t see any warning.

michaelwechner · November 30, 2020, 1:57pm

Hi, I think i have the same problem as you: I tried running some sample-projects using CDP (cdpQuadtree, cdpBezierTesselation) on Windows with a RTX3070 and also see no kernels in the timeline:
The Screenshot is from tracing the sample cdpQuadtree:

The warnings are just about Unified Memory tracing not working, because this system does not support P2P Access.

Versions:
Driver: 357.30
Nsight Systems: 2020.4.1
CUDA Toolkit: 11.1

other Samples (like eigenvalues or others) work fine and the kernels show up so it seems to be caused by the usage of dynamic paralellism: I would not care if child kernels cannot be displayed, but all kernels disappearing from the timeline by having a single CDP-Kernel in a project does not seem right.

If anyone has fixes, think I missed something or needs further information, I would be glad to hear from you.

Best,

Michael

mbrzeszcz · January 7, 2021, 7:59am

I can confirm that Nsight Systems does not profile ANY kernels from a library/.cu file where CDP is used. Note that the kernel using the CDP does not need to be used in the parent app, its presence in the .cu file is sufficient for all kernels from the .cu to be unprofilable. Kernels from other libs, eg. cudnn or cublas are shown just fine.

My env: Windows 10, RTX 2060, cuda 11.1 (with nsight systems 2020.3.4)

Topic		Replies	Views
How to profile dynamic parallelism CUPTI – CUDA Profiler Tools Interface	9	2337	November 29, 2023
Nsight Systems doesn't profile kernels Profiling Linux Targets	13	3243	January 27, 2022
Nsys doesn't show cuda kernel and memory data Profiling Linux Targets cuda , kernel	9	25	December 7, 2024
No kernels were profiled warning/problem Nsight Compute	17	9796	December 28, 2021
Kernels in CUDA streams seems not running in parallel Profiling Linux Targets	8	772	April 7, 2024
Nsys does not show CUDA kernels Profiling Linux Targets	6	1214	December 12, 2022
Nsight-Compute returns “No kernels were profiled” warning Nsight Compute	9	1233	July 27, 2023
Missing kernels in NSight Profiling Nsight Visual Studio Edition	4	2005	October 2, 2015
nsys CUDA trace works for threads, but not for subprocesses Profiling Linux Targets	3	2297	May 13, 2019
Nsight Systems does not show CUDA and other accelerator trace simultaneously Profiling Embedded Targets	9	2417	November 3, 2022

No CUDA kernels shown in nsys profiler timeline when using dynamic parallelism

Related topics