Hi all,
I’m encountering an issue with a persistent CUDA kernel on an NVIDIA H100 NVL where it doesn’t seem to respond to flag updates from the host, causing it to get stuck in an active state. I’ve been trying to debug this for a while and could use some expert insights, especially given the H100’s Hopper architecture.
System Details:
- GPU: NVIDIA H100 NVL (Compute Capability 9.0)
- Driver: 565.57.01
- CUDA Toolkit: 12.7
- Compilation: nvcc -o test test.cu -arch=sm_90
Problem Description:
I’m testing a single persistent kernel that runs in a loop, controlled by two flags (activeFlag and doneFlag) mapped to host memory using cudaHostAlloc with cudaHostAllocMapped. The kernel should:
- Launch and wait in a loop.
- Increment a counter when *activeFlag == 1.
- Exit when *doneFlag == 1.
The host toggles *h_active between 0 and 1 for three cycles, then sets *h_done = 1 to terminate. However, the kernel:
- Launches successfully (“Kernel launched” prints).
- Sees the initial *h_active = 1 and starts incrementing the counter.
- Never sees subsequent updates (e.g., *h_active = 0 or *h_done = 1), getting stuck printing “Kernel counter” values indefinitely (e.g., up to 25,931,000 and beyond).
Code:
Here’s a simplified version of my code:
#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
__global__ void persistentKernel(int *counter, volatile int *activeFlag, volatile int *doneFlag) {
if (threadIdx.x == 0 && blockIdx.x == 0) {
printf("Kernel launched\n");
while (*doneFlag == 0) {
__threadfence();
int active = *activeFlag;
if (active % 100000 == 0) {
printf("Kernel state - active: %d, done: %d\n", active, *doneFlag);
}
if (active == 1) {
int val = atomicAdd(counter, 1);
if (val % 100000 == 0) {
printf("Kernel counter: %d\n", val);
}
}
for (volatile int i = 0; i < 5000000; i++); // Delay
}
printf("Kernel terminating\n");
}
}
void checkCudaError(cudaError_t err, const char *msg) {
if (err != cudaSuccess) {
fprintf(stderr, "%s: %s\n", msg, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
}
int main() {
cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync | cudaDeviceMapHost);
int *h_counter, *h_active, *h_done;
cudaHostAlloc(&h_counter, sizeof(int), cudaHostAllocMapped);
cudaHostAlloc(&h_active, sizeof(int), cudaHostAllocMapped);
cudaHostAlloc(&h_done, sizeof(int), cudaHostAllocMapped);
*h_counter = 0; *h_active = 0; *h_done = 0;
int *d_counter, *d_active, *d_done;
cudaHostGetDevicePointer(&d_counter, h_counter, 0);
cudaHostGetDevicePointer(&d_active, h_active, 0);
cudaHostGetDevicePointer(&d_done, h_done, 0);
cudaStream_t stream;
cudaStreamCreate(&stream);
printf("Launching kernel...\n");
persistentKernel<<<1, 1, 0, stream>>>(d_counter, d_active, d_done);
cudaGetLastError(); // Check launch
sleep(2); cudaStreamSynchronize(stream);
for (int i = 0; i < 3; i++) {
printf("\nCycle %d - Activating\n", i + 1);
*h_active = 1; cudaStreamSynchronize(stream);
sleep(3);
printf("Cycle %d - Deactivating\n", i + 1);
*h_active = 0; cudaStreamSynchronize(stream);
sleep(1);
}
printf("Terminating...\n");
*h_done = 1; cudaStreamSynchronize(stream);
printf("Final counter: %d\n", *h_counter);
cudaFreeHost(h_counter); cudaFreeHost(h_active); cudaFreeHost(h_done);
cudaStreamDestroy(stream);
return 0;
}
Observed Behavior:
Launching kernel...
Kernel launched
Kernel state - active: 0, done: 0
Cycle 1 - Activating
Kernel state - active: 1, done: 0
Kernel counter: 100000
Kernel counter: 200000
[...]
- After *h_active = 1, it prints “Kernel counter” values continuously (e.g., up to 25,931,000) and never progresses to “Cycle 1 - Deactivating” or “Kernel terminating”.
What I’ve Tried:
- Used volatile on flags to prevent caching.
- Added __threadfence() for memory visibility.
- Synced with cudaStreamSynchronize after each flag update.
- Used mapped host memory for direct host-device communication.
- Added a delay loop in the kernel to slow it down.
Any suggestions or insights would be greatly appreciated! I’m happy to provide more details or test alternative approaches.
Thank you!
Jules