Persistent Kernel Not Responding to Flag Updates on NVIDIA H100 NVL (CUDA 12.7)

jules.gransden · March 19, 2025, 10:16am

Hi all,

I’m encountering an issue with a persistent CUDA kernel on an NVIDIA H100 NVL where it doesn’t seem to respond to flag updates from the host, causing it to get stuck in an active state. I’ve been trying to debug this for a while and could use some expert insights, especially given the H100’s Hopper architecture.

System Details:

GPU: NVIDIA H100 NVL (Compute Capability 9.0)
Driver: 565.57.01
CUDA Toolkit: 12.7
Compilation: nvcc -o test test.cu -arch=sm_90

Problem Description:

I’m testing a single persistent kernel that runs in a loop, controlled by two flags (activeFlag and doneFlag) mapped to host memory using cudaHostAlloc with cudaHostAllocMapped. The kernel should:

Launch and wait in a loop.
Increment a counter when *activeFlag == 1.
Exit when *doneFlag == 1.

The host toggles *h_active between 0 and 1 for three cycles, then sets *h_done = 1 to terminate. However, the kernel:

Launches successfully (“Kernel launched” prints).
Sees the initial *h_active = 1 and starts incrementing the counter.
Never sees subsequent updates (e.g., *h_active = 0 or *h_done = 1), getting stuck printing “Kernel counter” values indefinitely (e.g., up to 25,931,000 and beyond).

Code:

Here’s a simplified version of my code:


#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

__global__ void persistentKernel(int *counter, volatile int *activeFlag, volatile int *doneFlag) {
    if (threadIdx.x == 0 && blockIdx.x == 0) {
        printf("Kernel launched\n");
        while (*doneFlag == 0) {
            __threadfence();
            int active = *activeFlag;
            if (active % 100000 == 0) {
                printf("Kernel state - active: %d, done: %d\n", active, *doneFlag);
            }
            if (active == 1) {
                int val = atomicAdd(counter, 1);
                if (val % 100000 == 0) {
                    printf("Kernel counter: %d\n", val);
                }
            }
            for (volatile int i = 0; i < 5000000; i++); // Delay
        }
        printf("Kernel terminating\n");
    }
}

void checkCudaError(cudaError_t err, const char *msg) {
    if (err != cudaSuccess) {
        fprintf(stderr, "%s: %s\n", msg, cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }
}

int main() {
    cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync | cudaDeviceMapHost);

    
    int *h_counter, *h_active, *h_done;
    cudaHostAlloc(&h_counter, sizeof(int), cudaHostAllocMapped);
    cudaHostAlloc(&h_active, sizeof(int), cudaHostAllocMapped);
    cudaHostAlloc(&h_done, sizeof(int), cudaHostAllocMapped);
    
    *h_counter = 0; *h_active = 0; *h_done = 0;
    
    int *d_counter, *d_active, *d_done;
    cudaHostGetDevicePointer(&d_counter, h_counter, 0);
    cudaHostGetDevicePointer(&d_active, h_active, 0);
    cudaHostGetDevicePointer(&d_done, h_done, 0);
    
    cudaStream_t stream;
    cudaStreamCreate(&stream);
    
    printf("Launching kernel...\n");
    persistentKernel<<<1, 1, 0, stream>>>(d_counter, d_active, d_done);
    cudaGetLastError(); // Check launch
    
    sleep(2); cudaStreamSynchronize(stream);
    
    for (int i = 0; i < 3; i++) {
        printf("\nCycle %d - Activating\n", i + 1);
        *h_active = 1; cudaStreamSynchronize(stream);
        sleep(3);
        
        printf("Cycle %d - Deactivating\n", i + 1);
        *h_active = 0; cudaStreamSynchronize(stream);
        sleep(1);
    }
    
    printf("Terminating...\n");
    *h_done = 1; cudaStreamSynchronize(stream);
    
    printf("Final counter: %d\n", *h_counter);
    
    cudaFreeHost(h_counter); cudaFreeHost(h_active); cudaFreeHost(h_done);
    cudaStreamDestroy(stream);
    return 0;
}

Observed Behavior:

Launching kernel...
Kernel launched
Kernel state - active: 0, done: 0
Cycle 1 - Activating
Kernel state - active: 1, done: 0
Kernel counter: 100000
Kernel counter: 200000
[...]

After *h_active = 1, it prints “Kernel counter” values continuously (e.g., up to 25,931,000) and never progresses to “Cycle 1 - Deactivating” or “Kernel terminating”.

What I’ve Tried:

Used volatile on flags to prevent caching.
Added __threadfence() for memory visibility.
Synced with cudaStreamSynchronize after each flag update.
Used mapped host memory for direct host-device communication.
Added a delay loop in the kernel to slow it down.

Any suggestions or insights would be greatly appreciated! I’m happy to provide more details or test alternative approaches.

Thank you!

Jules

Curefab · March 19, 2025, 11:21am

Perhaps remove it from L2 cache. You could look through PTX instructions.

njuffa · March 19, 2025, 5:53pm

That is not what that does. volatile means “this data object is subject to modification by an actor outside the scope of this code”, in consequence instructing the compiler that every access to the object must result in access to its assigned memory location (that is, it may not use a copy in a register instead). It has no bearing whatsoever on the handling of that object by the processor’s cache hierarchy. In other words, while volatile is necessary when multiple actors are operating on the same data object, it is often not sufficient just by itself.

PTX offers various load instruction attributes (“cache operators”) that may be useful to address the issue at hand. I have not looked at them in over a decade. From a cursory look, .cv may be worth trying.

Note: I have not really looked at the code. It may have other issues.

Robert_Crovella · March 19, 2025, 6:49pm

You launch a kernel into a stream and then immediately do a cudaStreamSynchronize on that stream. The CPU thread could not proceed beyond that point unless the kernel terminates. Your posted code doesn’t make sense to me in that regard, and should not be able to produce the output you claim. This is a straightforward consequence of the stream usage.

I get seemingly sensible behavior by commenting out all but the last instance of cudaStreamSynchronize().

This isn’t doing anything useful that I can see, in terms of “taming” the printf spew from the kernel:

So I modified your kernel code a bit, to tame that spew. Here is my test case:

# cat t363.cu
#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

__global__ void persistentKernel(int *counter, volatile int *activeFlag, volatile int *doneFlag) {
    if (threadIdx.x == 0 && blockIdx.x == 0) {
        printf("Kernel launched\n");
        unsigned long long cnt = 0;
        while (*doneFlag == 0) {
            __threadfence();
            int active = *activeFlag;
            if (active == 0) {
                if (((cnt++)  % 100000) == 0)
                  printf("Kernel state - active: %d, done: %d\n", active, *doneFlag);
            }
            if (active == 1) {
                int val = atomicAdd(counter, 1);
                if (val % 100000 == 0) {
                    printf("Kernel counter: %d\n", val);
                }
            }
            for (volatile int i = 0; i < 5000000; i++); // Delay
        }
        printf("Kernel terminating\n");
    }
}

void checkCudaError(cudaError_t err, const char *msg) {
    if (err != cudaSuccess) {
        fprintf(stderr, "%s: %s\n", msg, cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }
}

int main() {
    cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync | cudaDeviceMapHost);


    int *h_counter, *h_active, *h_done;
    cudaHostAlloc(&h_counter, sizeof(int), cudaHostAllocMapped);
    cudaHostAlloc(&h_active, sizeof(int), cudaHostAllocMapped);
    cudaHostAlloc(&h_done, sizeof(int), cudaHostAllocMapped);

    *h_counter = 0; *h_active = 0; *h_done = 0;

    int *d_counter, *d_active, *d_done;
    cudaHostGetDevicePointer(&d_counter, h_counter, 0);
    cudaHostGetDevicePointer(&d_active, h_active, 0);
    cudaHostGetDevicePointer(&d_done, h_done, 0);

    cudaStream_t stream;
    cudaStreamCreate(&stream);

    printf("Launching kernel...\n");
    persistentKernel<<<1, 1, 0, stream>>>(d_counter, d_active, d_done);
    cudaGetLastError(); // Check launch

    sleep(2); //cudaStreamSynchronize(stream);

    for (int i = 0; i < 3; i++) {
        printf("\nCycle %d - Activating\n", i + 1);
        *h_active = 1; //cudaStreamSynchronize(stream);
        sleep(3);

        printf("Cycle %d - Deactivating\n", i + 1);
        *h_active = 0;// cudaStreamSynchronize(stream);
        sleep(1);
    }

    printf("Terminating...\n");
    *h_done = 1; cudaStreamSynchronize(stream);

    printf("Final counter: %d\n", *h_counter);

    cudaFreeHost(h_counter); cudaFreeHost(h_active); cudaFreeHost(h_done);
    cudaStreamDestroy(stream);
    return 0;
}

# nvcc -o t363 t363.cu
# ./t363
Launching kernel...
Kernel launched
Kernel state - active: 0, done: 0
Kernel state - active: 0, done: 0
Kernel state - active: 0, done: 0
Kernel state - active: 0, done: 0
Kernel state - active: 0, done: 0
Kernel state - active: 0, done: 0
Kernel state - active: 0, done: 0
Kernel state - active: 0, done: 0
Kernel state - active: 0, done: 0
Kernel state - active: 0, done: 0
Kernel state - active: 0, done: 0
Kernel state - active: 0, done: 0
Kernel state - active: 0, done: 0

Cycle 1 - Activating
Kernel counter: 0
Kernel counter: 100000
Kernel counter: 200000
Kernel counter: 300000
Kernel counter: 400000
Kernel counter: 500000
Kernel counter: 600000
Kernel counter: 700000
Kernel counter: 800000
Kernel counter: 900000
Kernel counter: 1000000
Kernel counter: 1100000
Kernel counter: 1200000
Kernel counter: 1300000
Cycle 1 - Deactivating
Kernel counter: 1400000
Kernel state - active: 0, done: 0
Kernel state - active: 0, done: 0
Kernel state - active: 0, done: 0
Kernel state - active: 0, done: 0
Kernel state - active: 0, done: 0
Kernel state - active: 0, done: 0

Cycle 2 - Activating
Kernel state - active: 0, done: 0
Kernel counter: 1500000
Kernel counter: 1600000
Kernel counter: 1700000
Kernel counter: 1800000
Kernel counter: 1900000
Kernel counter: 2000000
Kernel counter: 2100000
Kernel counter: 2200000
Kernel counter: 2300000
Kernel counter: 2400000
Kernel counter: 2500000
Kernel counter: 2600000
Kernel counter: 2700000
Kernel counter: 2800000
Cycle 2 - Deactivating
Kernel state - active: 0, done: 0
Kernel state - active: 0, done: 0
Kernel state - active: 0, done: 0
Kernel state - active: 0, done: 0
Kernel state - active: 0, done: 0
Kernel state - active: 0, done: 0

Cycle 3 - Activating
Kernel counter: 2900000
Kernel counter: 3000000
Kernel counter: 3100000
Kernel counter: 3200000
Kernel counter: 3300000
Kernel counter: 3400000
Kernel counter: 3500000
Kernel counter: 3600000
Kernel counter: 3700000
Kernel counter: 3800000
Kernel counter: 3900000
Kernel counter: 4000000
Kernel counter: 4100000
Kernel counter: 4200000
Cycle 3 - Deactivating
Kernel state - active: 0, done: 0
Kernel state - active: 0, done: 0
Kernel state - active: 0, done: 0
Kernel state - active: 0, done: 0
Kernel state - active: 0, done: 0
Kernel state - active: 0, done: 0
Terminating...
Kernel state - active: 0, done: 0
Kernel terminating
Final counter: 4230989
#

Curefab · March 19, 2025, 7:22pm

Fully d’accord to njuffa.

Another C++ consequence is that the relative order of volatile accesses is the same as in the program flow.

OTOH it could be that some of the intent of volatile gets lost after translation into PTX. The ptxas optimizing compiler may also reorder instructions. (There is a comparable case: On CPUs the CPU itself is allowed to reorder instructions.)

So for a deeper dive perhaps the memory consistence model chapter of 1. Introduction — PTX ISA 9.1 documentation is a good (but technical) read.

njuffa · March 19, 2025, 7:47pm

“Compiler” is to be understood as a whole, regardless of how it is structured into phases. Thinking about it, the CUDA compiler somehow must transfer the notion of volatile to ptxas, otherwise ptxas would happily map a simple integer object to a register if the modifier had somehow been lost at that point. I have not checked how it does this.

[Later:]
A quick check reveals that NVVM uses the .volatile attribute to mark memory access to volatile data: ld.volatile.global.u32, thus passing this information to pxtas. I used this test kernel to check:

__global__ void kernel (volatile int *src, int *dest)
{
    int t;
    t = (*src);
    t = t * (*src);
    *dest = t;
}

Out of order CPUs may issue instructions to execution units in other than program order, however instructions are retired in program order, and only at the retire stage are the changes committed to architectural state (that is, visible to the programmer). This is an application of the “as-if” rule: the behavior of the program must be as if the processor was following the abstract execution model.

jules.gransden · March 20, 2025, 3:01pm

You are totally right,
cudaDeviceSynchronize() disables the kernels to run concurrently and was the source of my problem.
Thank you!

jules.gransden · March 20, 2025, 4:46pm

Maybe you can ALso help me understand, this other test case… I am trying to alternate between two persistent kernels using their own active flags. Have you an idea how I would go about this? My current implementation refuses to acknoledge the second kernel…

#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h> // For sleep

// Kernel 1 - alternates with Kernel 2
__global__ void persistentKernel1(int *counter1, int *counter2, volatile int *active1, volatile int *active2, volatile int *doneFlag) {
    if (threadIdx.x == 0 && blockIdx.x == 0) {
        printf("Kernel 1 launched\n");
        
        while (atomicAdd((int*)doneFlag, 0) == 0) {
            __threadfence();
            int isActive = atomicAdd((int*)active1, 0);
            
            if (isActive == 0) {
                for (volatile int i = 0; i < 1000000; i++); // Idle delay
                continue;
            }
            
            if (isActive == 1) {
                int val = atomicAdd(counter1, 1);
                if (val % 10000 == 0) {
                    printf("Kernel 1 counter: %d\n", val);
                }
                for (volatile int i = 0; i < 100000; i++); // Work delay
                
                // After some work, deactivate self and activate Kernel 2
                if (val % 50000 == 0) { // Alternate every 50,000 increments
                    atomicExch((int*)active1, 0); // Deactivate self
                    atomicExch((int*)active2, 1); // Activate Kernel 2
                    __threadfence(); // Ensure visibility
                }
            }
        }
        printf("Kernel 1 terminating\n");
    }
}

// Kernel 2 - alternates with Kernel 1
__global__ void persistentKernel2(int *counter1, int *counter2, volatile int *active1, volatile int *active2, volatile int *doneFlag) {
    if (threadIdx.x == 0 && blockIdx.x == 0) {
        printf("Kernel 2 launched\n");
        
        while (atomicAdd((int*)doneFlag, 0) == 0) {
            __threadfence();
            int isActive = atomicAdd((int*)active2, 0);
            
            if (isActive == 0) {
                for (volatile int i = 0; i < 1000000; i++); // Idle delay
                continue;
            }
            
            if (isActive == 1) {
                int val = atomicAdd(counter2, 1);
                if (val % 10000 == 0) {
                    printf("Kernel 2 counter: %d\n", val);
                }
                for (volatile int i = 0; i < 100000; i++); // Work delay
                
                // After some work, deactivate self and activate Kernel 1
                if (val % 50000 == 0) { // Alternate every 50,000 increments
                    atomicExch((int*)active2, 0); // Deactivate self
                    atomicExch((int*)active1, 1); // Activate Kernel 1
                    __threadfence(); // Ensure visibility
                }
            }
        }
        printf("Kernel 2 terminating\n");
    }
}

// Error checking helper
void checkCudaError(cudaError_t err, const char *msg) {
    if (err != cudaSuccess) {
        fprintf(stderr, "%s: %s\n", msg, cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }
}

int main() {
    // Device info
    cudaDeviceProp prop;
    checkCudaError(cudaGetDeviceProperties(&prop, 0), "Get device properties failed");
    printf("Running on device: %s\n", prop.name);
    printf("Concurrent kernel execution: %s\n", prop.concurrentKernels ? "Supported" : "Not supported");
    
    // Set device flags
    checkCudaError(cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync | cudaDeviceMapHost | cudaDeviceLmemResizeToMax),
                   "Set device flags failed");

    // Allocate device memory (not mapped, since kernels control flags)
    int *d_counter1, *d_counter2, *d_active1, *d_active2, *d_doneFlag;
    checkCudaError(cudaMalloc(&d_counter1, sizeof(int)), "Counter1 alloc failed");
    checkCudaError(cudaMalloc(&d_counter2, sizeof(int)), "Counter2 alloc failed");
    checkCudaError(cudaMalloc(&d_active1, sizeof(int)), "Active1 alloc failed");
    checkCudaError(cudaMalloc(&d_active2, sizeof(int)), "Active2 alloc failed");
    checkCudaError(cudaMalloc(&d_doneFlag, sizeof(int)), "Done alloc failed");

    // Allocate host memory for results
    int *h_counter1, *h_counter2;
    checkCudaError(cudaHostAlloc(&h_counter1, sizeof(int), cudaHostAllocMapped), "Host counter1 alloc failed");
    checkCudaError(cudaHostAlloc(&h_counter2, sizeof(int), cudaHostAllocMapped), "Host counter2 alloc failed");

    // Initialize memory
    checkCudaError(cudaMemset(d_counter1, 0, sizeof(int)), "Counter1 memset failed");
    checkCudaError(cudaMemset(d_counter2, 0, sizeof(int)), "Counter2 memset failed");
    checkCudaError(cudaMemset(d_active1, 1, sizeof(int)), "Active1 memset failed"); // Start with Kernel 1 active
    checkCudaError(cudaMemset(d_active2, 0, sizeof(int)), "Active2 memset failed");
    checkCudaError(cudaMemset(d_doneFlag, 0, sizeof(int)), "Done memset failed");

    // Create streams
    cudaStream_t stream1, stream2;
    checkCudaError(cudaStreamCreateWithFlags(&stream1, cudaStreamNonBlocking), "Stream1 creation failed");
    checkCudaError(cudaStreamCreateWithFlags(&stream2, cudaStreamNonBlocking), "Stream2 creation failed");

    // Launch kernels
    printf("Launching kernels...\n");
    persistentKernel1<<<1, 1, 0, stream1>>>(d_counter1, d_counter2, d_active1, d_active2, d_doneFlag);
    checkCudaError(cudaGetLastError(), "Kernel 1 launch failed");
    
    persistentKernel2<<<1, 1, 0, stream2>>>(d_counter1, d_counter2, d_active1, d_active2, d_doneFlag);
    checkCudaError(cudaGetLastError(), "Kernel 2 launch failed");

    // Let kernels run and alternate for a while
    printf("Kernels launched, letting them alternate for 15 seconds...\n");
    sleep(15);

    // Terminate kernels
    printf("\nTerminating kernels...\n");
    checkCudaError(cudaMemset(d_doneFlag, 1, sizeof(int)), "Set done flag failed");
    
    // Wait for completion
    checkCudaError(cudaDeviceSynchronize(), "Final sync failed");

    // Copy results to host
    checkCudaError(cudaMemcpy(h_counter1, d_counter1, sizeof(int), cudaMemcpyDeviceToHost), "Copy counter1 failed");
    checkCudaError(cudaMemcpy(h_counter2, d_counter2, sizeof(int), cudaMemcpyDeviceToHost), "Copy counter2 failed");

    // Print results
    printf("\nResults:\n");
    printf("Kernel 1 total counter: %d\n", *h_counter1);
    printf("Kernel 2 total counter: %d\n", *h_counter2);

    // Cleanup
    checkCudaError(cudaFree(d_counter1), "Free counter1 failed");
    checkCudaError(cudaFree(d_counter2), "Free counter2 failed");
    checkCudaError(cudaFree(d_active1), "Free active1 failed");
    checkCudaError(cudaFree(d_active2), "Free active2 failed");
    checkCudaError(cudaFree(d_doneFlag), "Free done failed");
    checkCudaError(cudaFreeHost(h_counter1), "Free host counter1 failed");
    checkCudaError(cudaFreeHost(h_counter2), "Free host counter2 failed");
    checkCudaError(cudaStreamDestroy(stream1), "Destroy stream1 failed");
    checkCudaError(cudaStreamDestroy(stream2), "Destroy stream2 failed");

    printf("Test completed successfully\n");
    return 0;
}

Robert_Crovella · March 20, 2025, 5:00pm

kernel to kernel bidirectional communication (i.e. necessitating concurrent execution) is a frowned on design pattern because CUDA doesn’t guarantee kernel execution concurrency.

Putting that aside, CUDA lazy loading will prevent your code from running “correctly” i.e. as desired.

The following modifications to work around lazy loading artifacts seems to give sensible output. In a nutshell we add a boolean parameter to each kernel to specify an early exit. We then launch both kernels specifying early exit. This allows lazy loading to complete. Thereafter the behavior seems sensible, with the code you have already written:

$ cat t4.cu
#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h> // For sleep

// Kernel 1 - alternates with Kernel 2
__global__ void persistentKernel1(bool exit, int *counter1, int *counter2, volatile int *active1, volatile int *active2, volatile int *doneFlag) {
  if (exit) return;
      if (threadIdx.x == 0 && blockIdx.x == 0) {
        printf("Kernel 1 launched\n");

        while (atomicAdd((int*)doneFlag, 0) == 0) {
            __threadfence();
            int isActive = atomicAdd((int*)active1, 0);

            if (isActive == 0) {
                for (volatile int i = 0; i < 1000000; i++); // Idle delay
                continue;
            }

            if (isActive == 1) {
                int val = atomicAdd(counter1, 1);
                if (val % 10000 == 0) {
                    printf("Kernel 1 counter: %d\n", val);
                }
                for (volatile int i = 0; i < 100000; i++); // Work delay

                // After some work, deactivate self and activate Kernel 2
                if (val % 50000 == 0) { // Alternate every 50,000 increments
                    atomicExch((int*)active1, 0); // Deactivate self
                    atomicExch((int*)active2, 1); // Activate Kernel 2
                    __threadfence(); // Ensure visibility
                }
            }
        }
        printf("Kernel 1 terminating\n");
    }
}

// Kernel 2 - alternates with Kernel 1
__global__ void persistentKernel2(bool exit, int *counter1, int *counter2, volatile int *active1, volatile int *active2, volatile int *doneFlag) {
  if (exit) return;
      if (threadIdx.x == 0 && blockIdx.x == 0) {
        printf("Kernel 2 launched\n");

        while (atomicAdd((int*)doneFlag, 0) == 0) {
            __threadfence();
            int isActive = atomicAdd((int*)active2, 0);

            if (isActive == 0) {
                for (volatile int i = 0; i < 1000000; i++); // Idle delay
                continue;
            }

            if (isActive == 1) {
                int val = atomicAdd(counter2, 1);
                if (val % 10000 == 0) {
                    printf("Kernel 2 counter: %d\n", val);
                }
                for (volatile int i = 0; i < 100000; i++); // Work delay

                // After some work, deactivate self and activate Kernel 1
                if (val % 50000 == 0) { // Alternate every 50,000 increments
                    atomicExch((int*)active2, 0); // Deactivate self
                    atomicExch((int*)active1, 1); // Activate Kernel 1
                    __threadfence(); // Ensure visibility
                }
            }
        }
        printf("Kernel 2 terminating\n");
    }
}

// Error checking helper
void checkCudaError(cudaError_t err, const char *msg) {
    if (err != cudaSuccess) {
        fprintf(stderr, "%s: %s\n", msg, cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }
}

int main() {
    // Device info
    cudaDeviceProp prop;
    checkCudaError(cudaGetDeviceProperties(&prop, 0), "Get device properties failed");
    printf("Running on device: %s\n", prop.name);
    printf("Concurrent kernel execution: %s\n", prop.concurrentKernels ? "Supported" : "Not supported");

    // Set device flags
    checkCudaError(cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync | cudaDeviceMapHost | cudaDeviceLmemResizeToMax),
                   "Set device flags failed");

    // Allocate device memory (not mapped, since kernels control flags)
    int *d_counter1, *d_counter2, *d_active1, *d_active2, *d_doneFlag;
    checkCudaError(cudaMalloc(&d_counter1, sizeof(int)), "Counter1 alloc failed");
    checkCudaError(cudaMalloc(&d_counter2, sizeof(int)), "Counter2 alloc failed");
    checkCudaError(cudaMalloc(&d_active1, sizeof(int)), "Active1 alloc failed");
    checkCudaError(cudaMalloc(&d_active2, sizeof(int)), "Active2 alloc failed");
    checkCudaError(cudaMalloc(&d_doneFlag, sizeof(int)), "Done alloc failed");

    // Allocate host memory for results
    int *h_counter1, *h_counter2;
    checkCudaError(cudaHostAlloc(&h_counter1, sizeof(int), cudaHostAllocMapped), "Host counter1 alloc failed");
    checkCudaError(cudaHostAlloc(&h_counter2, sizeof(int), cudaHostAllocMapped), "Host counter2 alloc failed");

    // Initialize memory
    checkCudaError(cudaMemset(d_counter1, 0, sizeof(int)), "Counter1 memset failed");
    checkCudaError(cudaMemset(d_counter2, 0, sizeof(int)), "Counter2 memset failed");
    checkCudaError(cudaMemset(d_active1, 1, sizeof(int)), "Active1 memset failed"); // Start with Kernel 1 active
    checkCudaError(cudaMemset(d_active2, 0, sizeof(int)), "Active2 memset failed");
    checkCudaError(cudaMemset(d_doneFlag, 0, sizeof(int)), "Done memset failed");

    // Create streams
    cudaStream_t stream1, stream2;
    checkCudaError(cudaStreamCreateWithFlags(&stream1, cudaStreamNonBlocking), "Stream1 creation failed");
    checkCudaError(cudaStreamCreateWithFlags(&stream2, cudaStreamNonBlocking), "Stream2 creation failed");

    // Launch kernels
    persistentKernel1<<<1, 1, 0, stream1>>>(true, d_counter1, d_counter2, d_active1, d_active2, d_doneFlag);
    persistentKernel2<<<1, 1, 0, stream2>>>(true, d_counter1, d_counter2, d_active1, d_active2, d_doneFlag);
    cudaDeviceSynchronize();
    printf("Launching kernels...\n");
    persistentKernel1<<<1, 1, 0, stream1>>>(false, d_counter1, d_counter2, d_active1, d_active2, d_doneFlag);
    checkCudaError(cudaGetLastError(), "Kernel 1 launch failed");

    persistentKernel2<<<1, 1, 0, stream2>>>(false, d_counter1, d_counter2, d_active1, d_active2, d_doneFlag);
    checkCudaError(cudaGetLastError(), "Kernel 2 launch failed");

    // Let kernels run and alternate for a while
    printf("Kernels launched, letting them alternate for 15 seconds...\n");
    sleep(15);

    // Terminate kernels
    printf("\nTerminating kernels...\n");
    checkCudaError(cudaMemset(d_doneFlag, 1, sizeof(int)), "Set done flag failed");

    // Wait for completion
    checkCudaError(cudaDeviceSynchronize(), "Final sync failed");

    // Copy results to host
    checkCudaError(cudaMemcpy(h_counter1, d_counter1, sizeof(int), cudaMemcpyDeviceToHost), "Copy counter1 failed");
    checkCudaError(cudaMemcpy(h_counter2, d_counter2, sizeof(int), cudaMemcpyDeviceToHost), "Copy counter2 failed");

    // Print results
    printf("\nResults:\n");
    printf("Kernel 1 total counter: %d\n", *h_counter1);
    printf("Kernel 2 total counter: %d\n", *h_counter2);

    // Cleanup
    checkCudaError(cudaFree(d_counter1), "Free counter1 failed");
    checkCudaError(cudaFree(d_counter2), "Free counter2 failed");
    checkCudaError(cudaFree(d_active1), "Free active1 failed");
    checkCudaError(cudaFree(d_active2), "Free active2 failed");
    checkCudaError(cudaFree(d_doneFlag), "Free done failed");
    checkCudaError(cudaFreeHost(h_counter1), "Free host counter1 failed");
    checkCudaError(cudaFreeHost(h_counter2), "Free host counter2 failed");
    checkCudaError(cudaStreamDestroy(stream1), "Destroy stream1 failed");
    checkCudaError(cudaStreamDestroy(stream2), "Destroy stream2 failed");

    printf("Test completed successfully\n");
    return 0;
}
$ nvcc -o t4 t4.cu -arch=sm_75
$ ./t4
Running on device: NVIDIA A40
Concurrent kernel execution: Supported
Launching kernels...
Kernels launched, letting them alternate for 15 seconds...
Kernel 1 launched
Kernel 2 launched

Terminating kernels...
Kernel 2 terminating
Kernel 1 terminating

Results:
Kernel 1 total counter: 0
Kernel 2 total counter: 0
Test completed successfully
$

Curefab · March 20, 2025, 5:07pm

You can always merge your kernels into one, and use cudaLaunchCooperativeKernel to ensure that all blocks are loaded concurrently, and then separate, which function to do according to either block id or sm id or warp id.

That would give you some kind of guarantee. (However, if there are not enough resources, your kernel would fail to be invoked.)

Instead of two streams, the now single kernel would be invoked by a single stream.

Topic		Replies	Views
Persistent Kernel does not work properly on some GPUs CUDA Programming and Performance cuda , kernel , nvbugs	5	3457	August 26, 2023
Kernels after a persistent kernel isn't executed unless running under Nsight System CUDA Programming and Performance cuda , kernel , nsight , nvcc	11	1352	August 25, 2023
Kernel doesn't start while perstistent kernel is running CUDA Programming and Performance kernel	1	364	May 10, 2022
Launching several kernels on one stream while another kernel running persistently in the background CUDA Programming and Performance	1	777	October 8, 2016
Persistent kernel runs slower when with more threads CUDA Programming and Performance	7	238	October 2, 2024
Inifinite loop with multi-stream shared data synchronisation starting with cuda 12.2 and nvidia driver version 535 CUDA Programming and Performance	2	494	August 29, 2023
Deadlock on cuda kernel launch CUDA Programming and Performance cuda	12	453	July 21, 2025
CUDA Streams, cuStreamWaitValue32, and Deadlocks in Inter-Kernel Communication CUDA Programming and Performance cuda , kernel , synchronization , programming	7	466	September 5, 2025
CUDA Kernel Synchronization Issue – Host cudaMemcpy Stuck & High GPU Utilization CUDA Programming and Performance	1	107	January 31, 2025
Persistent kernel runs slower when with more threads Jetson Orin NX cuda	5	244	October 14, 2024