GPU Memory Leak in nppiConvert_8u32f_C1R_Ctx Function

Our team is experiencing a continuous increase in GPU memory usage in an image processing application that uses the NPP library. After investigation, we identified nppiConvert_8u32f_C1R_Ctx as the root cause.

Through testing, I confirmed that a memory leak occurs when all of the following conditions are met:

  1. Using cudaStreamDefault as the CUDA stream

  2. Input parameters where width % 16 != 0 && stride % 16 == 0

  3. Calling the function from multiple threads

When examining the CUDA API call history with Nsight, I observed that in the problematic case, a new stream is being created internally, which does not occur in normal cases. It appears that the internal behavior differs depending on the input parameter sizes.

Could you please explain this behavior? Any insights would be greatly appreciated.

Environment:

  • CUDA: 12.3
  • Driver: 550.163.01
  • OS: Rocky Linux 9.3
  • Compiler: gcc 11.5.0

Reproduction test code:

#include <cuda_runtime.h>
#include <npp.h>
#include <nppi.h>
#include <iostream>
#include <thread>

void initNppContext(cudaStream_t stream, NppStreamContext& nppCtx) {
    nppCtx.hStream = stream;
    cudaGetDevice(&nppCtx.nCudaDeviceId);
    
    cudaDeviceProp deviceProp;
    cudaGetDeviceProperties(&deviceProp, nppCtx.nCudaDeviceId);
    nppCtx.nMultiProcessorCount = deviceProp.multiProcessorCount;
    nppCtx.nMaxThreadsPerMultiProcessor = deviceProp.maxThreadsPerMultiProcessor;
    nppCtx.nMaxThreadsPerBlock = deviceProp.maxThreadsPerBlock;
    nppCtx.nSharedMemPerBlock = deviceProp.sharedMemPerBlock;
    
    cudaStreamGetFlags(stream, &nppCtx.nStreamFlags);
}

void printMemoryResult(const char* testName, int width, int stride, size_t initialMem, size_t finalMem) {
    double increase = (finalMem - initialMem) / 1024.0 / 1024.0;
    std::cout << testName << " (width=" << width << ", stride=" << stride << ")" << std::endl;
    std::cout << "  Initial: " << (initialMem / 1024.0 / 1024.0) << " MB" << std::endl;
    std::cout << "  Final: " << (finalMem / 1024.0 / 1024.0) << " MB" << std::endl;
    std::cout << "  Increase: " << increase << " MB " 
              << (increase > 5.0 ? "❌" : "✓") << std::endl << std::endl;
}

void testCaseWithoutThread(int width, int height, int stride, int iterations) {
    size_t free, total;
    cudaMemGetInfo(&free, &total);
    size_t initialMem = total - free;
    
    cudaStream_t stream;
    cudaStreamCreateWithFlags(&stream, cudaStreamDefault);
    
    NppStreamContext nppCtx;
    initNppContext(stream, nppCtx);
    
    for (int i = 0; i < iterations; i++) {
        uint8_t* d_src;
        Npp32f* d_dst;
        cudaMalloc(&d_src, stride * height);
        cudaMalloc(&d_dst, stride * sizeof(Npp32f) * height);
        
        NppiSize imgSize = {width, height};
        nppiConvert_8u32f_C1R_Ctx(d_src, stride, d_dst, stride * sizeof(Npp32f), imgSize, nppCtx);
        cudaStreamSynchronize(stream);
        
        cudaFree(d_src);
        cudaFree(d_dst);
    }
    
    cudaStreamDestroy(stream);
    
    cudaMemGetInfo(&free, &total);
    printMemoryResult("Without thread", width, stride, initialMem, total - free);
}

void testCaseWithThread(int width, int height, int stride, int iterations) {
    size_t free, total;
    cudaMemGetInfo(&free, &total);
    size_t initialMem = total - free;
    
    cudaStream_t stream;
    cudaStreamCreateWithFlags(&stream, cudaStreamDefault);
    
    NppStreamContext nppCtx;
    initNppContext(stream, nppCtx);
    
    for (int i = 0; i < iterations; i++) {
        std::thread t([&]() {
            uint8_t* d_src;
            Npp32f* d_dst;
            cudaMalloc(&d_src, stride * height);
            cudaMalloc(&d_dst, stride * sizeof(Npp32f) * height);
            
            NppiSize imgSize = {width, height};
            nppiConvert_8u32f_C1R_Ctx(d_src, stride, d_dst, stride * sizeof(Npp32f), imgSize, nppCtx);
            cudaStreamSynchronize(stream);
            
            cudaFree(d_src);
            cudaFree(d_dst);
        });
        t.join();
    }
    
    cudaStreamDestroy(stream);
    
    cudaMemGetInfo(&free, &total);
    printMemoryResult("Thread per call", width, stride, initialMem, total - free);
}

int main() {
    const int iterations = 100;

    // No leak: stride % 16 != 0
    std::cout << "=== Case 1: width=32, height=35, stride=36 ===" << std::endl;
    testCaseWithoutThread(32, 35, 36, iterations);
    testCaseWithThread(32, 35, 36, iterations);
    
    // Leak in thread test: width % 16 != 0 && stride % 16 == 0
    std::cout << "=== Case 2: width=31, height=31, stride=32 ===" << std::endl;
    testCaseWithoutThread(31, 31, 32, iterations);
    testCaseWithThread(31, 31, 32, iterations);
    
    return 0;
}
=== Case 1: width=32, height=35, stride=36 ===
Without thread (width=32, stride=36)
  Initial: 192.188 MB
  Final: 192.188 MB
  Increase: 0 MB ✓

Thread per call (width=32, stride=36)
  Initial: 192.188 MB
  Final: 192.188 MB
  Increase: 0 MB ✓

=== Case 2: width=31, height=31, stride=32 ===
Without thread (width=31, stride=32)
  Initial: 192.188 MB
  Final: 192.188 MB
  Increase: 0 MB ✓

Thread per call (width=31, stride=32)
  Initial: 192.188 MB
  Final: 238.188 MB
  Increase: 46 MB ❌

Thread per call (width=32, stride=36)

Thread per call (width=31, stride=32)

I suggest filing a bug.

1 Like

Thanks for filing a ticket . 5766830 is well received and we can replicate the behavior. We will bring back conclusion here when the internal investigation is completed.

1 Like

Hi dear developer , this is not a memory leak. All device memory are actually released after the app exit/cuda context destroy.

The observed ‘leak’ alike behavior is due to the more streams are not destroyed until the app exits. If you change
cudaStreamCreateWithFlags(&stream, cudaStreamDefault);
to
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
no ‘leak’ will be observed in your case.

I experimented with 10 iteration, in your original case , both with and without thread case are actually using blocking stream alongside the default stream 0/NULL.
Here are how they look like
withthread - the NPP kernels are split onto 11 streams , there are 2 kernels , one is spawned on default stream 14 and the other type of kernel are spawned on a new stream in each iteration .

withoutthread- actually same , still spawned across 2 streams due to NPP kernel uses stream0 and the user codes create blocking stream where runs the other kernel.

When we change to non-blocking stream , the withthread case looks like

your original codes run like

ni@node2:~/yni/Customerbug/5766830$ ./a.out
=== Case 1: width=32, height=35, stride=36 ===
Without thread (width=32, stride=36)
Initial: 99.8125 MB
Final: 99.8125 MB
Increase: 0 MB ✓

Thread per call (width=32, stride=36)
Initial: 99.8125 MB
Final: 99.8125 MB
Increase: 0 MB ✓

=== Case 2: width=31, height=31, stride=32 ===
Without thread (width=31, stride=32)
Initial: 99.8125 MB
Final: 99.8125 MB
Increase: 0 MB ✓

Thread per call (width=31, stride=32)
Initial: 99.8125 MB
Final: 99.8125 MB
Increase: 0 MB ✓

Hope this explains the behavior.
The main difference between cudaStreamCreateWithFlags(&stream, cudaStreamDefault) and cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking) is whether the created stream participates in implicit synchronization with the legacy default stream (stream 0 / NULL stream).

  • Using cudaStreamDefault (value 0) creates a blocking stream. A blocking stream implicitly synchronizes with the legacy default stream: work in the default stream waits for all previously issued work in blocking streams to finish, and blocking streams also wait for work in the default stream. This can unintentionally serialize kernels and make them appear to run on or be ordered by the default stream instead of behaving as fully independent streams.
  • Using cudaStreamNonBlocking creates a non‑blocking stream. Non‑blocking streams do not implicitly synchronize with the default stream. Kernels in such a stream can run concurrently with kernels in the default stream and other non‑blocking streams, and there is no hidden ordering between them unless you add explicit synchronization.

In practice, when you want true concurrency and clean separation between your own streams and any library code that may use the default stream (for example, NPP), you should prefer cudaStreamNonBlocking.

Thank you for the detailed explanation. I understand that using cudaStreamNonBlocking is the better practice in production, and that with cudaStreamDefault, streams accumulate but the memory is properly released upon context destruction.

Beyond just understanding the phenomenon itself, I am very curious about the underlying architecture and the design intent behind this behavior. Gaining insight into these internal workings would be incredibly helpful for my application development.

Could you please clarify the following points?

  1. Implicit Creation: Why are streams (or related resources) implicitly created when using cudaStreamDefault?

  2. Resource Persistence: Why do these created streams/resources persist in memory instead of being released immediately after their execution is complete?

  3. Parameter Dependency: Even when using cudaStreamDefault, why does this internal stream creation occur conditionally based on specific parameters (e.g., width, stride)?

Any guidance or explanation you could provide would be incredibly helpful. Thank you again for your valuable time.