GPU Memory Leak in nppiConvert_8u32f_C1R_Ctx Function

Our team is experiencing a continuous increase in GPU memory usage in an image processing application that uses the NPP library. After investigation, we identified nppiConvert_8u32f_C1R_Ctx as the root cause.

Through testing, I confirmed that a memory leak occurs when all of the following conditions are met:

  1. Using cudaStreamDefault as the CUDA stream

  2. Input parameters where width % 16 != 0 && stride % 16 == 0

  3. Calling the function from multiple threads

When examining the CUDA API call history with Nsight, I observed that in the problematic case, a new stream is being created internally, which does not occur in normal cases. It appears that the internal behavior differs depending on the input parameter sizes.

Could you please explain this behavior? Any insights would be greatly appreciated.

Environment:

  • CUDA: 12.3
  • Driver: 550.163.01
  • OS: Rocky Linux 9.3
  • Compiler: gcc 11.5.0

Reproduction test code:

#include <cuda_runtime.h>
#include <npp.h>
#include <nppi.h>
#include <iostream>
#include <thread>

void initNppContext(cudaStream_t stream, NppStreamContext& nppCtx) {
    nppCtx.hStream = stream;
    cudaGetDevice(&nppCtx.nCudaDeviceId);
    
    cudaDeviceProp deviceProp;
    cudaGetDeviceProperties(&deviceProp, nppCtx.nCudaDeviceId);
    nppCtx.nMultiProcessorCount = deviceProp.multiProcessorCount;
    nppCtx.nMaxThreadsPerMultiProcessor = deviceProp.maxThreadsPerMultiProcessor;
    nppCtx.nMaxThreadsPerBlock = deviceProp.maxThreadsPerBlock;
    nppCtx.nSharedMemPerBlock = deviceProp.sharedMemPerBlock;
    
    cudaStreamGetFlags(stream, &nppCtx.nStreamFlags);
}

void printMemoryResult(const char* testName, int width, int stride, size_t initialMem, size_t finalMem) {
    double increase = (finalMem - initialMem) / 1024.0 / 1024.0;
    std::cout << testName << " (width=" << width << ", stride=" << stride << ")" << std::endl;
    std::cout << "  Initial: " << (initialMem / 1024.0 / 1024.0) << " MB" << std::endl;
    std::cout << "  Final: " << (finalMem / 1024.0 / 1024.0) << " MB" << std::endl;
    std::cout << "  Increase: " << increase << " MB " 
              << (increase > 5.0 ? "❌" : "✓") << std::endl << std::endl;
}

void testCaseWithoutThread(int width, int height, int stride, int iterations) {
    size_t free, total;
    cudaMemGetInfo(&free, &total);
    size_t initialMem = total - free;
    
    cudaStream_t stream;
    cudaStreamCreateWithFlags(&stream, cudaStreamDefault);
    
    NppStreamContext nppCtx;
    initNppContext(stream, nppCtx);
    
    for (int i = 0; i < iterations; i++) {
        uint8_t* d_src;
        Npp32f* d_dst;
        cudaMalloc(&d_src, stride * height);
        cudaMalloc(&d_dst, stride * sizeof(Npp32f) * height);
        
        NppiSize imgSize = {width, height};
        nppiConvert_8u32f_C1R_Ctx(d_src, stride, d_dst, stride * sizeof(Npp32f), imgSize, nppCtx);
        cudaStreamSynchronize(stream);
        
        cudaFree(d_src);
        cudaFree(d_dst);
    }
    
    cudaStreamDestroy(stream);
    
    cudaMemGetInfo(&free, &total);
    printMemoryResult("Without thread", width, stride, initialMem, total - free);
}

void testCaseWithThread(int width, int height, int stride, int iterations) {
    size_t free, total;
    cudaMemGetInfo(&free, &total);
    size_t initialMem = total - free;
    
    cudaStream_t stream;
    cudaStreamCreateWithFlags(&stream, cudaStreamDefault);
    
    NppStreamContext nppCtx;
    initNppContext(stream, nppCtx);
    
    for (int i = 0; i < iterations; i++) {
        std::thread t([&]() {
            uint8_t* d_src;
            Npp32f* d_dst;
            cudaMalloc(&d_src, stride * height);
            cudaMalloc(&d_dst, stride * sizeof(Npp32f) * height);
            
            NppiSize imgSize = {width, height};
            nppiConvert_8u32f_C1R_Ctx(d_src, stride, d_dst, stride * sizeof(Npp32f), imgSize, nppCtx);
            cudaStreamSynchronize(stream);
            
            cudaFree(d_src);
            cudaFree(d_dst);
        });
        t.join();
    }
    
    cudaStreamDestroy(stream);
    
    cudaMemGetInfo(&free, &total);
    printMemoryResult("Thread per call", width, stride, initialMem, total - free);
}

int main() {
    const int iterations = 100;

    // No leak: stride % 16 != 0
    std::cout << "=== Case 1: width=32, height=35, stride=36 ===" << std::endl;
    testCaseWithoutThread(32, 35, 36, iterations);
    testCaseWithThread(32, 35, 36, iterations);
    
    // Leak in thread test: width % 16 != 0 && stride % 16 == 0
    std::cout << "=== Case 2: width=31, height=31, stride=32 ===" << std::endl;
    testCaseWithoutThread(31, 31, 32, iterations);
    testCaseWithThread(31, 31, 32, iterations);
    
    return 0;
}
=== Case 1: width=32, height=35, stride=36 ===
Without thread (width=32, stride=36)
  Initial: 192.188 MB
  Final: 192.188 MB
  Increase: 0 MB ✓

Thread per call (width=32, stride=36)
  Initial: 192.188 MB
  Final: 192.188 MB
  Increase: 0 MB ✓

=== Case 2: width=31, height=31, stride=32 ===
Without thread (width=31, stride=32)
  Initial: 192.188 MB
  Final: 192.188 MB
  Increase: 0 MB ✓

Thread per call (width=31, stride=32)
  Initial: 192.188 MB
  Final: 238.188 MB
  Increase: 46 MB ❌

Thread per call (width=32, stride=36)

Thread per call (width=31, stride=32)

I suggest filing a bug.

1 Like

Thanks for filing a ticket . 5766830 is well received and we can replicate the behavior. We will bring back conclusion here when the internal investigation is completed.

1 Like

Hi dear developer , this is not a memory leak. All device memory are actually released after the app exit/cuda context destroy.

The observed ‘leak’ alike behavior is due to the more streams are not destroyed until the app exits. If you change
cudaStreamCreateWithFlags(&stream, cudaStreamDefault);
to
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
no ‘leak’ will be observed in your case.

I experimented with 10 iteration, in your original case , both with and without thread case are actually using blocking stream alongside the default stream 0/NULL.
Here are how they look like
withthread - the NPP kernels are split onto 11 streams , there are 2 kernels , one is spawned on default stream 14 and the other type of kernel are spawned on a new stream in each iteration .

withoutthread- actually same , still spawned across 2 streams due to NPP kernel uses stream0 and the user codes create blocking stream where runs the other kernel.

When we change to non-blocking stream , the withthread case looks like

your original codes run like

ni@node2:~/yni/Customerbug/5766830$ ./a.out
=== Case 1: width=32, height=35, stride=36 ===
Without thread (width=32, stride=36)
Initial: 99.8125 MB
Final: 99.8125 MB
Increase: 0 MB ✓

Thread per call (width=32, stride=36)
Initial: 99.8125 MB
Final: 99.8125 MB
Increase: 0 MB ✓

=== Case 2: width=31, height=31, stride=32 ===
Without thread (width=31, stride=32)
Initial: 99.8125 MB
Final: 99.8125 MB
Increase: 0 MB ✓

Thread per call (width=31, stride=32)
Initial: 99.8125 MB
Final: 99.8125 MB
Increase: 0 MB ✓

Hope this explains the behavior.
The main difference between cudaStreamCreateWithFlags(&stream, cudaStreamDefault) and cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking) is whether the created stream participates in implicit synchronization with the legacy default stream (stream 0 / NULL stream).

  • Using cudaStreamDefault (value 0) creates a blocking stream. A blocking stream implicitly synchronizes with the legacy default stream: work in the default stream waits for all previously issued work in blocking streams to finish, and blocking streams also wait for work in the default stream. This can unintentionally serialize kernels and make them appear to run on or be ordered by the default stream instead of behaving as fully independent streams.
  • Using cudaStreamNonBlocking creates a non‑blocking stream. Non‑blocking streams do not implicitly synchronize with the default stream. Kernels in such a stream can run concurrently with kernels in the default stream and other non‑blocking streams, and there is no hidden ordering between them unless you add explicit synchronization.

In practice, when you want true concurrency and clean separation between your own streams and any library code that may use the default stream (for example, NPP), you should prefer cudaStreamNonBlocking.

Thank you for the detailed explanation. I understand that using cudaStreamNonBlocking is the better practice in production, and that with cudaStreamDefault, streams accumulate but the memory is properly released upon context destruction.

Beyond just understanding the phenomenon itself, I am very curious about the underlying architecture and the design intent behind this behavior. Gaining insight into these internal workings would be incredibly helpful for my application development.

Could you please clarify the following points?

  1. Implicit Creation: Why are streams (or related resources) implicitly created when using cudaStreamDefault?

  2. Resource Persistence: Why do these created streams/resources persist in memory instead of being released immediately after their execution is complete?

  3. Parameter Dependency: Even when using cudaStreamDefault, why does this internal stream creation occur conditionally based on specific parameters (e.g., width, stride)?

Any guidance or explanation you could provide would be incredibly helpful. Thank you again for your valuable time.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

The issue is a resource persistence conflict between NPP’s internal optimization strategies and your dynamic threading model.

  1. Trigger Condition:
  • When processing images with unaligned widths (e.g., width % 32 != 0) using a Blocking Stream(cudaStreamDefault), NPP triggers a more complex optimization path.
  • It splits the operation into two kernels (one for the aligned body, one for the edge) and implicitly spawns internal auxiliary streams to manage the synchronization of these kernels with the global Legacy Default Stream.
  1. The behavior (Resource Persistence):
  • NPP caches these auxiliary streams in an internal global map to reuse them for performance.
  • Crucial Failure: NPP has no mechanism to detect when your generic std::thread exits or when your cudaStream_t is destroyed.
  • Because your gRPC server uses a dynamically managed thread pool where threads are frequently created and destroyed, every new thread ID generates a new entry in NPP’s internal cache.
  • These entries are never evicted, causing the “cache” to grow unbounded over time (O(N) with total threads spawned), manifesting as the continuous 100–200 MB memory growth and eventual OOM unless the process exists. The size is around every stream 2 MB.

Given that your application is a long-running gRPC server, you might implement the following changes to resolve the OOM issue and improve stability.

  1. Immediate code fix is to use cudaStreamNonBlocking
  2. Architectural improvement is to move to a fixed/recycled thread pool
  3. Data optimization. Align Image Steps/Strides

Best,
Yuki

1 Like