Stream sync behaving like a device sync on first use of device API fns printf, cudaMalloc etc

I have observed an unexpected CUDA behaviour which we can’t find documented, namely that certain CUDA on device API calls (printf, cudaMalloc etc) seem to cause a non-blocking stream synchronisation on an unrelated stream to behave like device synchronisation. This only happens the first time they are used. We noticed the problem when there is a persistent kernel running and the stream sync never exits. The workaround is to ensure that you exercise the problem APIs at least once and sync before the persistent kernel runs.

Specifically, I replicated it like this:

1 Launching WarmUpKernel to use cudaMalloc once (needed to avoid a hang below)
2 Launching Dummy persistent kernel
3 Launching DummyShortKernel in non-blobking stream B (also uses cudaMalloc)
4 cudaStreamSynchronize on stream B  <- Hangs if no step 1

I replicated this in a standalone visual studio project (relocatable device code on)

#include <cuda_runtime_api.h>
#include <stdio.h>

// Persistent kernel - infinite loop, never returns.
__global__ void DummyPk()
{
    while (1) {   __nanosleep(1000);   }
}

// This kernel uess the device APIs
__global__ void DummyShortKernel()
{
    void* pTemp;
    cudaMalloc(&pTemp, sizeof(int));
    cudaFree(pTemp);
}

// this kernel exercises the deice APIs once before they are used for real. 
__global__ void WarmUpKernel()
{
    void* pTemp;
    cudaMalloc(&pTemp, sizeof(int));
    cudaFree(pTemp);
}

int main()
{
    cudaStream_t StreamA = 0;
    cudaStream_t StreamB = 0;
    const bool bRunWarmup = false;

    CUDA_CHECK(cudaStreamCreateWithFlags(&StreamA, cudaStreamNonBlocking));
    CUDA_CHECK(cudaStreamCreateWithFlags(&StreamB, cudaStreamNonBlocking));

    // Touch the function attributes to force JIT compilation now
    cudaFuncAttributes attr;
    CUDA_CHECK(cudaFuncGetAttributes(&attr, &DummyPk));
    CUDA_CHECK(cudaFuncGetAttributes(&attr, &DummyShortKernel));
    CUDA_CHECK(cudaFuncGetAttributes(&attr, &WarmUpKernel));

    // launch warmup kernel to exercise APIs first time
    if (bRunWarmup)
    {
        printf("Launching WarmUpKernel...\n");
        WarmUpKernel << <1, 1, 0, StreamA >> > ();
        CUDA_CHECK(cudaGetLastError());
        CUDA_CHECK(cudaStreamSynchronize(StreamA));
    }

    // Launch persistent kernel which never exits.
    printf("Launching DummyPk (never exits)...\n");
    DummyPk << <1, 1, 0, StreamA >> > ();
    CUDA_CHECK(cudaGetLastError());

    printf("Launching DummyShortKernel on StreamB...\n");
    DummyShortKernel << <1, 1, 0, StreamB >> > ();
    CUDA_CHECK(cudaGetLastError());

    // Without the warmup Kernel call this stream sync hangs! Why?
    printf("Calling cudaStreamSynchronize(StreamB)\n");
    CUDA_CHECK(cudaStreamSynchronize(StreamB));  // Coda hangs here iff bRunWarmup is false
    printf("cudaStreamSynchronize(StreamB) returned.\n");

    printf("Done.\n");
    return 0;
}

With const bool bRunWarmup = false;

Launching DummyPk (never exits)...
Launching DummyShortKernel on StreamB...
Calling cudaStreamSynchronize(StreamB)

and there the program hangs.

With const bool bRunWarmup = true;

Launching WarmUpKernel...
CDP warm-up
Launching DummyPk (never exits)...
Launching DummyShortKernel on StreamB...
Calling cudaStreamSynchronize(StreamB)
In Short kernel
cudaStreamSynchronize(StreamB) returned.
Done.

It seems to also apply to these fns: printf(), cudaMemsetAsync, cudaMemcpyAsync, cudaStreamCreateWithFlags, cudaStreamDestroy.

Can anyone explain this behaviour or find where it is documented?

It might be the synchronization associated with lazy loading.

Thanks for the speedy reply. I should have said we’re using cuda 12.9 and the environment variable CUDA_MODULE_LOADING=EAGER is set. As we’re working on a real time application we need to avoid lazy loading.

The code runs fine for me in both cases with driver 580.95.05 with cuda 12.8 and cuda 13.0 . (gpu arch sm_86)

Thanks for trying. We are also on sm_86 (rtx3080 or 3050). If possible could you give me your environment vars (i.e. set (win) or export (linux) output) and nvcc command so I can see if there are any relevant differences. It seems unlikely that just 12.9 has the issue if 12.8 and 13.0 are OK but I’ll have to wait until we go to 13.0 to test that.

I ran on cc8.9 (L4 GPU) on Linux with CUDA 13.0 and was able to reproduce the hang. I also ran on cc7.5 on godbolt with CUDA 13.0 and CUDA 13.1 and was able to reproduce the hang (although it requires a little deduction on godbolt - godbolt apparently times out waiting for your code after ~60 seconds). My cc8.9 case was using driver 580.65.06. If time permits I will try switching that machine to driver 580.95.05.

Hi Robert, Is there any progress or timeline for investigating this issue?

I haven’t made any further progress. My suggestion would be to file a bug. I’m not really convinced that it is a bug. I have investigated issues like this in the past with persistent kernels along with other “simultaneous” activity, and eventually went to the dev teams because I couldn’t understand the behavior. In at least one case, the response that I got back was that it was expected behavior and the only documentation support was the note that any CUDA API call can have variable latency.

I’m not saying this is that, or anything really, except that it might be a bug or it might not. But by filing a bug, the dev team will usually look at it.

Thanks for filing 6150942. We can replciate the behavior and we are investigating this. We will bring back conclusion when available.

The printf device syscall writes in the to printf FIFO which is sized using cuCtxSetLimit with CU_LIMIT_PRINTF_FIFO_SIZE.

The malloc/free syscalls alloc/free out of the device heap which is sized using cuCtxSetLimit with CU_LIMIT_MALLC_HEAP_SIZE.

Thread local stack memory is referenced by {logical smid, warpid, threadid} into the stack allocation which is sized CU_LIMIT_STACK_SIZE.

CUDA driver lazy allocates most of these buffers. Allocation an reallocation can require synchronization. If you use these syscalls I highly recommend you size and immediately call dummy kernels using these featuers to trigger the allocation of the buffers. Any resizing of these via cuCtxSetLimit can require a context level synchronization.

Nsight Visual Studio Edition CUDA trace and NVIDIA Visual Profiler showed when these were resized. Nsight Systems and Nsight Compute currently do not show and warn on the impact of resizing these resources.

CUDA programming model can support “persistent” kernels but it is not designed to support unending persistency as there are CUDA context level resources that require a synchronization to resize.