Kernels after a persistent kernel isn't executed unless running under Nsight System

Hi, I have a persistent kernel function that keeps checking a variable until the condition is satisfied. However, kernels after it is no longer executed even they are assigned to another stream and the program stuck on async kernel launch. But when I ran it under Nsight System, it works just fine.
I am running on Ubuntu20.04, GPU A6000 with compute_86 and CUDA12.2. The following is a sample code that cause the stuck.


__global__ void myKernel(size_t total_windows) {
    printf("my kernel executed\n");
    if (!total_windows) return;
}


__global__ void pkCheck(volatile size_t *curr_window, size_t total_windows) {

    while (true)
    {
        __nanosleep(10000000);
        size_t num_windows = *curr_window;
        if (num_windows >= total_windows) {
            printf("===all windows processed\n");
            break;
        }
    }
}

int run(FFTPlan &plan) {
    // windows: 2048*5000
    size_t data[2048] = {0};

    size_t *in, *out;
    cudaMalloc(&in, sizeof(size_t) * plan.total_samples);
    cudaMalloc(&out, sizeof(size_t) * plan.total_samples);
    size_t total_windows = plan.total_samples / plan.window_size;
    size_t *curr_window;
    cudaMallocManaged(&curr_window, sizeof(size_t), cudaMemAttachHost);
    curr_window[0] = 0;
    cudaCheck();

    cudaStream_t proc_stream, control_stream;
    cudaStreamCreateWithFlags(&control_stream, cudaStreamNonBlocking);
    cudaStreamCreateWithFlags(&proc_stream, cudaStreamNonBlocking);
    cudaCheck();

    printf("pkCheck\n");
    pkCheck<<<1, 1, 0, control_stream>>>(curr_window, total_windows);
    cudaCheck();

    printf("myKernel\n");
    myKernel<<<1, 1, 0, proc_stream>>>(total_windows);
    cudaCheck();

    printf("copying\n");
    for (int i = 0; i < total_windows; i++) {
        // cudaMemset(in+i*plan.window_size, 0, sizeof(size_t) * plan.window_size); // This stuck on window 1016
        cudaMemcpy(in+i*plan.window_size, data, sizeof(size_t) * plan.window_size, cudaMemcpyHostToDevice); // This stuck on window 0
        // cudaMemcpyAsync(in+i*plan.window_size, data, sizeof(size_t) * plan.window_size, cudaMemcpyHostToDevice, proc_stream); // This stuck on window 249
        curr_window[0]++;
        printf("window: %d\n", i);
        cudaCheck();
    }
    cudaDeviceSynchronize();
    return 0;
}

The output is like

$ ./mycode
pkCheck
myKernel
copying
window: 0

And it won’t proceed anymore, myKernel is not executed.
However, if I run it under nsight system, it finished as I expected,

pkCheck
myKernel
copying
my kernel executed
window: 0
window: 1
window: 2
window: 3
....
window: 4998
window: 4999
===all windows processed
Generating '/home/liuxs/tmp/nsys-report-79f3.qdstrm'

[1/1] [0%                          ] nsys_report.nsys-rep
[1/1] [0%                          ] nsys_report.nsys-rep
[1/1] [8%                          ] nsys_report.nsys-rep
[1/1] [=16%                        ] nsys_report.nsys-rep
[1/1] [==================78%       ] nsys_report.nsys-rep
[1/1] [========================100%] nsys_report.nsys-rep
[1/1] [========================100%] nsys_report.nsys-rep
Generated:
    XX/nsys_report.nsys-rep

Any idea on cause of this weird behavior? What is the difference between running it directly / under nsys?

Did you check that your device is allowed to access managed memory allocated with flag cudaMemAttachHost ?
Can’t you just use the default flag cudaMemAttachGlobal?

Yes, I did. Actually, if I don’t launch the second kernel, the program will finish with "===all windows processed\n". I also tested without that flag, the behavior is the same.

I suppose the problem is caused by the stuck second kernel, and memcpy is then stuck by the second kernel. In fact, if I don’t do any memcpy, the program will finish, but the second kernel won’t be executed until the first kernel exits.

....
window: 9994
window: 9995
window: 9996
window: 9997
window: 9998
window: 9999
===all windows processed
my kernel executed

Also, if the second kernel myKernel was launched before, it also works, to give an example

__global__ void myKernel(size_t total_windows) {
    if (!total_windows) return;
    else kernel_sleep(total_windows*1000);
    printf("my kernel executed\n");
}


__global__ void pkCheck(volatile size_t *curr_window, size_t total_windows) {

    while (true)
    {
        __nanosleep(10000000);
        size_t num_windows = *curr_window;
        if (num_windows >= total_windows) {
            printf("===all windows processed\n");
            break;
        }
    }
}

int run(FFTPlan &plan) {

    size_t data[2048] = {0};

    size_t *in, *out;
    cudaMalloc(&in, sizeof(size_t) * plan.total_samples);
    cudaMalloc(&out, sizeof(size_t) * plan.total_samples);
    size_t total_windows = plan.total_samples / plan.window_size;
    size_t *curr_window;
    // cudaMallocManaged(&curr_window, sizeof(size_t), cudaMemAttachHost);
    cudaMallocManaged(&curr_window, sizeof(size_t));
    curr_window[0] = 0;
    cudaCheck();

    cudaStream_t proc_stream, control_stream;
    cudaStreamCreateWithFlags(&control_stream, cudaStreamNonBlocking);
    cudaStreamCreateWithFlags(&proc_stream, cudaStreamNonBlocking);
    cudaCheck();

    printf("myKernel\n");
    myKernel<<<1, 1, 0, proc_stream>>>(0); // prelaunch
    cudaCheck();

    printf("pkCheck\n");
    pkCheck<<<1, 1, 0, control_stream>>>(curr_window, total_windows);
    cudaCheck();

    printf("myKernel\n");
    myKernel<<<1, 1, 0, proc_stream>>>(total_windows);
    cudaCheck();

    printf("copying\n");
    for (int i = 0; i < total_windows; i++) {
        // cudaMemset(in+i*plan.window_size, 0, sizeof(size_t) * plan.window_size); // This stuck on window 1016
        cudaMemcpy(in+i*plan.window_size, data, sizeof(size_t) * plan.window_size, cudaMemcpyHostToDevice); // This stuck on window 0
        // cudaMemcpyAsync(in+i*plan.window_size, data, sizeof(size_t) * plan.window_size, cudaMemcpyHostToDevice, proc_stream); // This stuck on window 249
        curr_window[0]++;
        printf("window: %d\n", i);
        cudaCheck();
    }
    cudaDeviceSynchronize();
    return 0;
}

The output is like

myKernel
pkCheck
myKernel
copying
window: 0
window: 1
window: 2
window: 3
window: 4
window: 5
window: 6
....
window: 240
window: 241
my kernel executed
window: 242
window: 243
window: 244
window: 245
....
window: 4999
===all windows processed

which looks like if the kernel’s instructions have already been loaded to the device L2 cache, it is able to proceed. Otherwise, some engine will be taken up by the persistent kernel.

Please show a complete minimal example that reproduces your problem.

Are you on windows in WSL?

I am on a Linux server running A6000.
Here is the minimum replicable code you can compile with nvcc test.cu -arch=sm_86 -rdc=true -lcudadevrt -std=c++14 -lineinfo and run with ./a.out -b 100

#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string>
#include "cuda.h"

using namespace std;

#define cudaCheck() {                                                   \
    cudaError_t err = cudaGetLastError();                               \
    if (cudaSuccess != err) {                                           \
        fprintf(stderr, "Error in %s:%i %s(): %s.\n", __FILE__, __LINE__,\
                __func__, cudaGetErrorString(err));                     \
        fflush(stderr);                                                 \
        exit(EXIT_FAILURE);                                             \
    }                                                                   \
}

struct Options
{
    size_t batch = 0;
    size_t size = 0;
    std::string filename;
};

__global__ void myKernel(size_t total_windows) {
    if (!total_windows) return;
    else __nanosleep(total_windows*1000);
    printf("my kernel executed\n");
}


__global__ void pkCheck(volatile size_t *curr_window, size_t total_windows) {

    while (true)
    {
        __nanosleep(10000000);
        size_t num_windows = *curr_window;
        if (num_windows >= total_windows) {
            printf("===all windows processed\n");
            break;
        }
    }
}

int run(size_t total_samples, size_t window_size) {

    size_t data[2048] = {0};

    size_t *in, *out;
    cudaMalloc(&in, sizeof(size_t) * total_samples);
    cudaMalloc(&out, sizeof(size_t) * total_samples);
    size_t total_windows = total_samples / window_size;
    size_t *curr_window;
    // cudaMallocManaged(&curr_window, sizeof(size_t), cudaMemAttachHost);
    cudaMallocManaged(&curr_window, sizeof(size_t));
    curr_window[0] = 0;
    cudaCheck();

    cudaStream_t proc_stream, control_stream;
    cudaStreamCreateWithFlags(&control_stream, cudaStreamNonBlocking);
    cudaStreamCreateWithFlags(&proc_stream, cudaStreamNonBlocking);
    cudaCheck();

    // printf("myKernel\n");
    // myKernel<<<1, 1, 0, proc_stream>>>(0); // prelaunch
    // cudaCheck();

    printf("pkCheck\n");
    pkCheck<<<1, 1, 0, control_stream>>>(curr_window, total_windows);
    cudaCheck();

    printf("myKernel\n");
    myKernel<<<1, 1, 0, proc_stream>>>(total_windows);
    cudaCheck();

    printf("copying\n");
    for (int i = 0; i < total_windows; i++) {
        // cudaMemset(in+i*window_size, 0, sizeof(size_t) * window_size); // This stuck on window 1016
        cudaMemcpy(in+i*window_size, data, sizeof(size_t) * window_size, cudaMemcpyHostToDevice); // This stuck on window 0
        // cudaMemcpyAsync(in+i*window_size, data, sizeof(size_t) * window_size, cudaMemcpyHostToDevice, proc_stream); // This stuck on window 249
        curr_window[0]++;
        printf("window: %d\n", i);
        cudaCheck();
    }
    cudaDeviceSynchronize();
    return 0;
}

int main(int argc, char **argv)
{
    int index;
    int c;

    Options options;

    opterr = 0;

    while ((c = getopt(argc, argv, "b:s:f:")) != -1)
        switch (c)
        {
        case 'b':
            options.batch = atoi(optarg);
            printf("batch size: %s\n", optarg);
            break;
        case 's':
            options.size = atoi(optarg);
            printf("grid size: %s\n", optarg);
            break;
        case 'f':
            options.filename.assign(optarg);
            printf("filename: %s\n", optarg);
            break;
        case '?':
            if (optopt == 'c')
                fprintf(stderr, "Option -%c requires an argument.\n", optopt);
            else if (isprint(optopt))
                fprintf(stderr, "Unknown option `-%c'.\n", optopt);
            else
                fprintf(stderr,
                        "Unknown option character `\\x%x'.\n",
                        optopt);
            return 1;
        default:
            abort();
        }


    for (index = optind; index < argc; index++)
        printf("Non-option argument %s\n", argv[index]);

    run(options.batch * 2048, 2048);

    return 0;
}

With sm_86, cuda 12.1, and sm_89, cuda 12.2 , the program produces normal output.

./a.out -b 100
batch size: 100
pkCheck
myKernel
copying
window: 0
window: 1
window: 2
window: 3
window: 4
window: 5
my kernel executed
window: 6
window: 7
window: 8
window: 9
window: 10
window: 11
window: 12
window: 13
window: 14
window: 15
window: 16
window: 17
window: 18
window: 19
window: 20
window: 21
window: 22
window: 23
window: 24
window: 25
window: 26
window: 27
window: 28
window: 29
window: 30
window: 31
window: 32
window: 33
window: 34
window: 35
window: 36
window: 37
window: 38
window: 39
window: 40
window: 41
window: 42
window: 43
window: 44
window: 45
window: 46
window: 47
window: 48
window: 49
window: 50
window: 51
window: 52
window: 53
window: 54
window: 55
window: 56
window: 57
window: 58
window: 59
window: 60
window: 61
window: 62
window: 63
window: 64
window: 65
window: 66
window: 67
window: 68
window: 69
window: 70
window: 71
window: 72
window: 73
window: 74
window: 75
window: 76
window: 77
window: 78
window: 79
window: 80
window: 81
window: 82
window: 83
window: 84
window: 85
window: 86
window: 87
window: 88
window: 89
window: 90
window: 91
window: 92
window: 93
window: 94
window: 95
window: 96
window: 97
window: 98
window: 99
===all windows processed

Thanks for your reply. Could you tell me what Linux version are you using? I just tested on another machine with 1050 ti (sm_61, CUDA12.0, ubuntu22.04), it also works. I am wondering what the cause of this weird behavior is. I am pretty sure the A6000 is installed on a server with a complete linux system.

$ uname -r
5.15.0-76-generic
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.6 LTS
Release:        20.04
Codename:       focal

Is there anyway to expose what the GPU is doing without running under nsys? I didn’t see much useful information from cuda-gdb.

$ cuda-gdb --args ./a.out -b 2
NVIDIA (R) CUDA Debugger
CUDA Toolkit 12.2 release
Portions Copyright (C) 2007-2023 NVIDIA Corporation
GNU gdb (GDB) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
--Type <RET> for more, q to quit, c to continue without paging--
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Using python library libpython3.8.so
Reading symbols from ./a.out...
(cuda-gdb) b myKernel
Breakpoint 1 at 0xb761: file test.cu, line 40.
(cuda-gdb) r         
Starting program: /home/liuxs/profile/a.out -b 2
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
batch size: 2
[New Thread 0x7ffff539c000 (LWP 894832)]
[Detaching after fork from child process 894833]
[New Thread 0x7ffff482f000 (LWP 894858)]
[New Thread 0x7fffe9fff000 (LWP 894859)]
[New Thread 0x7fffe9399000 (LWP 894860)]
pkCheck
myKernel
^C
Thread 1 "a.out" received signal SIGINT, Interrupt.
0x00007ffff5f99444 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
(cuda-gdb) bt
#0  0x00007ffff5f99444 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#1  0x00007ffff62fb106 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007ffff6097b0f in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007ffff6097c7e in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007ffff6317702 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007ffff5f30dff in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#6  0x00007ffff60de744 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#7  0x0000555555588f4f in __cudart563 ()
#8  0x000055555558bbe4 in __cudart564 ()
#9  0x0000555555564c51 in __cudart808 ()
#10 0x00005555555c1c8b in cudaLaunchKernel ()
#11 0x000055555555fa49 in cudaLaunchKernel<char> (
--Type <RET> for more, q to quit, c to continue without paging--

It didn’t even reach my breakpoint in myKernel

Hi linuxs, did you figure out the root cause of this? Your code works fine on a Linux machine with A100 but is stuck on another machine with 3090. I guess it has something to do with driver configurations.

I am also able to run that on A40 and 1050ti, but still not working on a A6000 with sm_86, CUDA12.2, driver 535.86.10.

Try:

CUDA_MODULE_LOADING=EAGER ./my_app

see here.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.