Persistent Kernel does not work properly on some GPUs

I’m working on a project that needs a persistent kernel to reduce the latency. However, I found that this technique only works on certain GPUs. I wrote two kernels, the first one is called PersistentKernel, which runs forever until it receives a signal to ask it to stop. The second kernel runs a short-time job and exits after finished.

#include <unistd.h>
#include <string>
#include <cuda.h>

#define CUDA_CHECK(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)


__global__ void OneTimeKernel(int iter) {
    __nanosleep(10 * 1000 * 1000);
    printf("One Time Kernel Finished, %d\n", iter);
}


__global__ void PersistentKernel(volatile int *signal) {
    int iter = 0;
    while (*signal) {
        __nanosleep(10 * 1000 * 1000);
        iter++;
        if (iter % 10 == 0) {
            printf("Persistent kernel is running, %d\n", iter);
        }
    }
}

int main(int argc, char **argv) {
    cudaStream_t stream, stream_background;
    cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
    cudaStreamCreateWithFlags(&stream_background, cudaStreamNonBlocking);
    CUDA_CHECK("Create streams");
    volatile int *signal;
    cudaMallocManaged(&signal, sizeof(int));
    CUDA_CHECK("Malloc");

    *signal = 1;

    PersistentKernel<<<1, 1, 0, stream_background>>>(signal);

    int n = 100;
    for (int i = 0; i < n; i++) {
        OneTimeKernel<<<1, 1, 0, stream>>>(i);
    }

    cudaStreamSynchronize(stream);
    CUDA_CHECK("One Time Kernel");
    *signal = 0;
    cudaStreamSynchronize(stream_background);
    CUDA_CHECK("Persistent Kernel");
    return 0;
}

This piece of code works fine on a machine with A100.

One Time Kernel Finished, 0
One Time Kernel Finished, 1
One Time Kernel Finished, 2
One Time Kernel Finished, 3
One Time Kernel Finished, 4
One Time Kernel Finished, 5
One Time Kernel Finished, 6
One Time Kernel Finished, 7
One Time Kernel Finished, 8
One Time Kernel Finished, 9
Persistent kernel is running, 10
One Time Kernel Finished, 10
One Time Kernel Finished, 11
One Time Kernel Finished, 12
One Time Kernel Finished, 13
One Time Kernel Finished, 14
One Time Kernel Finished, 15
One Time Kernel Finished, 16
One Time Kernel Finished, 17
One Time Kernel Finished, 18
One Time Kernel Finished, 19
Persistent kernel is running, 20
One Time Kernel Finished, 20
One Time Kernel Finished, 21
One Time Kernel Finished, 22
One Time Kernel Finished, 23
One Time Kernel Finished, 24
One Time Kernel Finished, 25
One Time Kernel Finished, 26
One Time Kernel Finished, 27
One Time Kernel Finished, 28
One Time Kernel Finished, 29
Persistent kernel is running, 30
One Time Kernel Finished, 30
One Time Kernel Finished, 31
One Time Kernel Finished, 32
One Time Kernel Finished, 33
One Time Kernel Finished, 34
One Time Kernel Finished, 35
One Time Kernel Finished, 36
One Time Kernel Finished, 37
One Time Kernel Finished, 38
One Time Kernel Finished, 39
Persistent kernel is running, 40
One Time Kernel Finished, 40
One Time Kernel Finished, 41
One Time Kernel Finished, 42
One Time Kernel Finished, 43
One Time Kernel Finished, 44
One Time Kernel Finished, 45
One Time Kernel Finished, 46
One Time Kernel Finished, 47
One Time Kernel Finished, 48
One Time Kernel Finished, 49
Persistent kernel is running, 50
One Time Kernel Finished, 50
One Time Kernel Finished, 51
One Time Kernel Finished, 52
One Time Kernel Finished, 53
One Time Kernel Finished, 54
One Time Kernel Finished, 55
One Time Kernel Finished, 56
One Time Kernel Finished, 57
One Time Kernel Finished, 58
One Time Kernel Finished, 59
One Time Kernel Finished, 60
Persistent kernel is running, 60
One Time Kernel Finished, 61
One Time Kernel Finished, 62
One Time Kernel Finished, 63
One Time Kernel Finished, 64
One Time Kernel Finished, 65
One Time Kernel Finished, 66
One Time Kernel Finished, 67
One Time Kernel Finished, 68
One Time Kernel Finished, 69
Persistent kernel is running, 70
One Time Kernel Finished, 70
One Time Kernel Finished, 71
One Time Kernel Finished, 72
One Time Kernel Finished, 73
One Time Kernel Finished, 74
One Time Kernel Finished, 75
One Time Kernel Finished, 76
One Time Kernel Finished, 77
One Time Kernel Finished, 78
One Time Kernel Finished, 79
Persistent kernel is running, 80
One Time Kernel Finished, 80
One Time Kernel Finished, 81
One Time Kernel Finished, 82
One Time Kernel Finished, 83
One Time Kernel Finished, 84
One Time Kernel Finished, 85
One Time Kernel Finished, 86
One Time Kernel Finished, 87
One Time Kernel Finished, 88
One Time Kernel Finished, 89
Persistent kernel is running, 90
One Time Kernel Finished, 90
One Time Kernel Finished, 91
One Time Kernel Finished, 92
One Time Kernel Finished, 93
One Time Kernel Finished, 94
One Time Kernel Finished, 95
One Time Kernel Finished, 96
One Time Kernel Finished, 97
One Time Kernel Finished, 98
One Time Kernel Finished, 99
Persistent kernel is running, 100

However, when I run it on another machine with 3090, the PersistentKernel like runs forever and the OneTime kernel does not have a chance to run at all even though I launch the kernels from different streams.

Persistent kernel is running, 10
Persistent kernel is running, 20
Persistent kernel is running, 30
Persistent kernel is running, 40
Persistent kernel is running, 50
Persistent kernel is running, 60
Persistent kernel is running, 70
....
Persistent kernel is running, 1930
Persistent kernel is running, 1940
^C

My question is: does this technique only work for specific GPU models, or it has something to do with drivers/configurations?

Machine configurations:

Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.6 LTS
Release:	20.04
Codename:	focal
Linux 5.15.0-79-generic #86~20.04.2-Ubuntu SMP Mon Jul 17 23:27:17 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

==============NVSMI LOG==============

Timestamp                                 : Tue Aug 22 14:13:53 2023
Driver Version                            : 535.86.10
CUDA Version                              : 12.2

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : NVIDIA GeForce RTX 3090
    Product Brand                         : GeForce
    Product Architecture                  : Ampere
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-80d8285b-3139-a256-3aef-b1ce4855be65
    Minor Number                          : 0
    VBIOS Version                         : 94.02.59.00.D6
    MultiGPU Board                        : No
    Board ID                              : 0x100
    Board Part Number                     : N/A
    GPU Part Number                       : 2204-300-A1
    FRU Part Number                       : N/A
    Module ID                             : 1
    Inforom Version
        Image Version                     : G001.0000.03.03
        OEM Object                        : 2.0
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x220410DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x38801028
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 1
                Device Current            : 1
                Device Max                : 4
                Host Max                  : 5
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : 30 %
    Performance State                     : P8
    Clocks Event Reasons
        Idle                              : Not Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 24576 MiB
        Reserved                          : 324 MiB
        Used                              : 26 MiB
        Free                              : 24225 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 12 MiB
        Free                              : 244 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 29 C
        GPU T.Limit Temp                  : N/A
        GPU Shutdown Temp                 : 98 C
        GPU Slowdown Temp                 : 95 C
        GPU Max Operating Temp            : 93 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    GPU Power Readings
        Power Draw                        : 28.32 W
        Current Power Limit               : 350.00 W
        Requested Power Limit             : 350.00 W
        Default Power Limit               : 350.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 350.00 W
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 0 MHz
        SM                                : 0 MHz
        Memory                            : 405 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 9751 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 0.000 mV
    Fabric
        State                             : N/A
        Status                            : N/A
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1728
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 9 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1920
            Type                          : G
            Name                          : /usr/bin/gnome-shell
            Used GPU Memory               : 6 MiB

I can reproduce the observation and at the moment I am unable to explain it. If you would like a workaround, try this:

*signal = 1;
OneTimeKernel<<<1, 1, 0, stream>>>(-1); // ADD THIS LINE
PersistentKernel<<<1, 1, 0, stream_background>>>(signal);

int n = 100;
for (int i = 0; i < n; i++) {
    OneTimeKernel<<<1, 1, 0, stream>>>(i);
}

I understand that work-around may not be acceptable for you, and at the moment I repeat that I am unable to explain this observation. You may wish to file a bug.

1 Like

Thank you, Robert. That workaround is magic! I’ve also filed a bug. I’m looking forward to your further investigations.

I am also looking forward to an explanation of this magic solution (post). Looks like only some cards have this problem.

This appears to be related to CUDA lazy module loading. The topic is covered in a few places, including here and here. Note that the changes were initially introduced in the CUDA 11.7 timeframe as an opt-in and then became default in the CUDA 12.2 timeframe. So one possible source of differences in observations may be the version of CUDA in each case.

Briefly, for the purpose of this discussion, a module can be thought of as the code for a kernel. In some cases, loading that kernel code may require a synchronization operation on the GPU. Synchronization basically means that all code execution activity must stop, before the sync op can complete. Operations that may introduce such synchronization that may be familiar include cudaMalloc and related, for example.

Prior to the changes introduced in CUDA 11.7 timeframe and with CUDA runtime API in view, module loading would typically be accomplished all at once at the point of CUDA initialization; typically the first CUDA runtime API call in your program/process. With all the modules loaded, there would be no need for a synchronization to load any module.

After CUDA 11.7 (opt-in) or CUDA 12.2 (default), module loading would not necessarily all be performed at once, at the beginning. Instead, some of it could be done in a “lazy” fashion, which we can interpret for this discussion as meaning “on-demand”. So the module for a kernel might get loaded the first time you call that kernel.

If that module load requires a synchronization, then all GPU execution activity in that context/process must stop, in order for that kernel to load, and subsequently run.

So in the case we have here, the first kernel (persistent) starts to run. The second kernel must begin running to avoid a hang, because we have a stream sync operation later on the stream that kernel is launched into. With lazy loading, this particular 2nd kernel seems to require a sync to load the module. But the sync waits forever because the first launched kernel “never stops”. As a result, we get a hang. The first kernel never stops, the sync is required at the point of the second kernel launch due to lazy loading, and the sync never completes. Since the sync never completes, the second kernel never starts running, so it never completes, so we hang at the stream sync point.

I won’t be able to argue the merits of this. There are certainly some possible benefits to lazy module loading. Other viewpoints are probably valid also.

At the CUDA 11.7 point, you had to opt-in to this behavior. At the CUDA 12.2 point, you have to opt-out of this behavior.

You can opt-out of this behavior by using a CUDA environment variable with your application launch:

CUDA_MODULE_LOADING=EAGER ./my_app

According to my testing, this fixes the issue for this case.

The workaround that I initially pointed out works because:

  1. reversing the order of the kernel launch forces the “OneTimeKernel” to module-load. Thereafter it is loaded. This kernel completes in a short amount of time, normally.
  2. the “PersistentKernel” may also need a module load, but this is OK because we can complete the sync process: the “OneTimeKernel” will finish.
  3. Once modules for both kernels are loaded, there is no longer any interaction with the module loading system, and so subsequent persistent/concurrent activity works “as expected”

You can emulate eager loading, without use of the env var, by using cudaFuncGetAttributes() on all needed kernels, prior to entering any concurrent execution areas.

Lazy loading now has its own full section in the programming guide.

5 Likes

Thank you, Robert. I’ve verified the issue is caused by the lazy loading. The links you posted are really helpful. I learned something from here.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.