I’m working on a project that needs a persistent kernel to reduce the latency. However, I found that this technique only works on certain GPUs. I wrote two kernels, the first one is called PersistentKernel, which runs forever until it receives a signal to ask it to stop. The second kernel runs a short-time job and exits after finished.
#include <unistd.h>
#include <string>
#include <cuda.h>
#define CUDA_CHECK(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
__global__ void OneTimeKernel(int iter) {
__nanosleep(10 * 1000 * 1000);
printf("One Time Kernel Finished, %d\n", iter);
}
__global__ void PersistentKernel(volatile int *signal) {
int iter = 0;
while (*signal) {
__nanosleep(10 * 1000 * 1000);
iter++;
if (iter % 10 == 0) {
printf("Persistent kernel is running, %d\n", iter);
}
}
}
int main(int argc, char **argv) {
cudaStream_t stream, stream_background;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
cudaStreamCreateWithFlags(&stream_background, cudaStreamNonBlocking);
CUDA_CHECK("Create streams");
volatile int *signal;
cudaMallocManaged(&signal, sizeof(int));
CUDA_CHECK("Malloc");
*signal = 1;
PersistentKernel<<<1, 1, 0, stream_background>>>(signal);
int n = 100;
for (int i = 0; i < n; i++) {
OneTimeKernel<<<1, 1, 0, stream>>>(i);
}
cudaStreamSynchronize(stream);
CUDA_CHECK("One Time Kernel");
*signal = 0;
cudaStreamSynchronize(stream_background);
CUDA_CHECK("Persistent Kernel");
return 0;
}
This piece of code works fine on a machine with A100.
One Time Kernel Finished, 0
One Time Kernel Finished, 1
One Time Kernel Finished, 2
One Time Kernel Finished, 3
One Time Kernel Finished, 4
One Time Kernel Finished, 5
One Time Kernel Finished, 6
One Time Kernel Finished, 7
One Time Kernel Finished, 8
One Time Kernel Finished, 9
Persistent kernel is running, 10
One Time Kernel Finished, 10
One Time Kernel Finished, 11
One Time Kernel Finished, 12
One Time Kernel Finished, 13
One Time Kernel Finished, 14
One Time Kernel Finished, 15
One Time Kernel Finished, 16
One Time Kernel Finished, 17
One Time Kernel Finished, 18
One Time Kernel Finished, 19
Persistent kernel is running, 20
One Time Kernel Finished, 20
One Time Kernel Finished, 21
One Time Kernel Finished, 22
One Time Kernel Finished, 23
One Time Kernel Finished, 24
One Time Kernel Finished, 25
One Time Kernel Finished, 26
One Time Kernel Finished, 27
One Time Kernel Finished, 28
One Time Kernel Finished, 29
Persistent kernel is running, 30
One Time Kernel Finished, 30
One Time Kernel Finished, 31
One Time Kernel Finished, 32
One Time Kernel Finished, 33
One Time Kernel Finished, 34
One Time Kernel Finished, 35
One Time Kernel Finished, 36
One Time Kernel Finished, 37
One Time Kernel Finished, 38
One Time Kernel Finished, 39
Persistent kernel is running, 40
One Time Kernel Finished, 40
One Time Kernel Finished, 41
One Time Kernel Finished, 42
One Time Kernel Finished, 43
One Time Kernel Finished, 44
One Time Kernel Finished, 45
One Time Kernel Finished, 46
One Time Kernel Finished, 47
One Time Kernel Finished, 48
One Time Kernel Finished, 49
Persistent kernel is running, 50
One Time Kernel Finished, 50
One Time Kernel Finished, 51
One Time Kernel Finished, 52
One Time Kernel Finished, 53
One Time Kernel Finished, 54
One Time Kernel Finished, 55
One Time Kernel Finished, 56
One Time Kernel Finished, 57
One Time Kernel Finished, 58
One Time Kernel Finished, 59
One Time Kernel Finished, 60
Persistent kernel is running, 60
One Time Kernel Finished, 61
One Time Kernel Finished, 62
One Time Kernel Finished, 63
One Time Kernel Finished, 64
One Time Kernel Finished, 65
One Time Kernel Finished, 66
One Time Kernel Finished, 67
One Time Kernel Finished, 68
One Time Kernel Finished, 69
Persistent kernel is running, 70
One Time Kernel Finished, 70
One Time Kernel Finished, 71
One Time Kernel Finished, 72
One Time Kernel Finished, 73
One Time Kernel Finished, 74
One Time Kernel Finished, 75
One Time Kernel Finished, 76
One Time Kernel Finished, 77
One Time Kernel Finished, 78
One Time Kernel Finished, 79
Persistent kernel is running, 80
One Time Kernel Finished, 80
One Time Kernel Finished, 81
One Time Kernel Finished, 82
One Time Kernel Finished, 83
One Time Kernel Finished, 84
One Time Kernel Finished, 85
One Time Kernel Finished, 86
One Time Kernel Finished, 87
One Time Kernel Finished, 88
One Time Kernel Finished, 89
Persistent kernel is running, 90
One Time Kernel Finished, 90
One Time Kernel Finished, 91
One Time Kernel Finished, 92
One Time Kernel Finished, 93
One Time Kernel Finished, 94
One Time Kernel Finished, 95
One Time Kernel Finished, 96
One Time Kernel Finished, 97
One Time Kernel Finished, 98
One Time Kernel Finished, 99
Persistent kernel is running, 100
However, when I run it on another machine with 3090, the PersistentKernel like runs forever and the OneTime kernel does not have a chance to run at all even though I launch the kernels from different streams.
Persistent kernel is running, 10
Persistent kernel is running, 20
Persistent kernel is running, 30
Persistent kernel is running, 40
Persistent kernel is running, 50
Persistent kernel is running, 60
Persistent kernel is running, 70
....
Persistent kernel is running, 1930
Persistent kernel is running, 1940
^C
My question is: does this technique only work for specific GPU models, or it has something to do with drivers/configurations?
Machine configurations:
Distributor ID: Ubuntu
Description: Ubuntu 20.04.6 LTS
Release: 20.04
Codename: focal
Linux 5.15.0-79-generic #86~20.04.2-Ubuntu SMP Mon Jul 17 23:27:17 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
==============NVSMI LOG==============
Timestamp : Tue Aug 22 14:13:53 2023
Driver Version : 535.86.10
CUDA Version : 12.2
Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : NVIDIA GeForce RTX 3090
Product Brand : GeForce
Product Architecture : Ampere
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : None
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-80d8285b-3139-a256-3aef-b1ce4855be65
Minor Number : 0
VBIOS Version : 94.02.59.00.D6
MultiGPU Board : No
Board ID : 0x100
Board Part Number : N/A
GPU Part Number : 2204-300-A1
FRU Part Number : N/A
Module ID : 1
Inforom Version
Image Version : G001.0000.03.03
OEM Object : 2.0
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x220410DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x38801028
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Device Current : 1
Device Max : 4
Host Max : 5
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : 30 %
Performance State : P8
Clocks Event Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 24576 MiB
Reserved : 324 MiB
Used : 26 MiB
Free : 24225 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 12 MiB
Free : 244 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : 29 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 93 C
GPU Target Temperature : 83 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
GPU Power Readings
Power Draw : 28.32 W
Current Power Limit : 350.00 W
Requested Power Limit : 350.00 W
Default Power Limit : 350.00 W
Min Power Limit : 100.00 W
Max Power Limit : 350.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 0 MHz
SM : 0 MHz
Memory : 405 MHz
Video : 555 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 2100 MHz
SM : 2100 MHz
Memory : 9751 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 0.000 mV
Fabric
State : N/A
Status : N/A
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 1728
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 9 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 1920
Type : G
Name : /usr/bin/gnome-shell
Used GPU Memory : 6 MiB