GPU Memory Leak in nppiConvert_8u32f_C1R_Ctx Function

woshaoxkf · December 23, 2025, 2:11pm

Our team is experiencing a continuous increase in GPU memory usage in an image processing application that uses the NPP library. After investigation, we identified nppiConvert_8u32f_C1R_Ctx as the root cause.

Through testing, I confirmed that a memory leak occurs when all of the following conditions are met:

Using cudaStreamDefault as the CUDA stream
Input parameters where width % 16 != 0 && stride % 16 == 0
Calling the function from multiple threads

When examining the CUDA API call history with Nsight, I observed that in the problematic case, a new stream is being created internally, which does not occur in normal cases. It appears that the internal behavior differs depending on the input parameter sizes.

Could you please explain this behavior? Any insights would be greatly appreciated.

Environment:

CUDA: 12.3
Driver: 550.163.01
OS: Rocky Linux 9.3
Compiler: gcc 11.5.0

Reproduction test code:

#include <cuda_runtime.h>
#include <npp.h>
#include <nppi.h>
#include <iostream>
#include <thread>

void initNppContext(cudaStream_t stream, NppStreamContext& nppCtx) {
    nppCtx.hStream = stream;
    cudaGetDevice(&nppCtx.nCudaDeviceId);
    
    cudaDeviceProp deviceProp;
    cudaGetDeviceProperties(&deviceProp, nppCtx.nCudaDeviceId);
    nppCtx.nMultiProcessorCount = deviceProp.multiProcessorCount;
    nppCtx.nMaxThreadsPerMultiProcessor = deviceProp.maxThreadsPerMultiProcessor;
    nppCtx.nMaxThreadsPerBlock = deviceProp.maxThreadsPerBlock;
    nppCtx.nSharedMemPerBlock = deviceProp.sharedMemPerBlock;
    
    cudaStreamGetFlags(stream, &nppCtx.nStreamFlags);
}

void printMemoryResult(const char* testName, int width, int stride, size_t initialMem, size_t finalMem) {
    double increase = (finalMem - initialMem) / 1024.0 / 1024.0;
    std::cout << testName << " (width=" << width << ", stride=" << stride << ")" << std::endl;
    std::cout << "  Initial: " << (initialMem / 1024.0 / 1024.0) << " MB" << std::endl;
    std::cout << "  Final: " << (finalMem / 1024.0 / 1024.0) << " MB" << std::endl;
    std::cout << "  Increase: " << increase << " MB " 
              << (increase > 5.0 ? "❌" : "✓") << std::endl << std::endl;
}

void testCaseWithoutThread(int width, int height, int stride, int iterations) {
    size_t free, total;
    cudaMemGetInfo(&free, &total);
    size_t initialMem = total - free;
    
    cudaStream_t stream;
    cudaStreamCreateWithFlags(&stream, cudaStreamDefault);
    
    NppStreamContext nppCtx;
    initNppContext(stream, nppCtx);
    
    for (int i = 0; i < iterations; i++) {
        uint8_t* d_src;
        Npp32f* d_dst;
        cudaMalloc(&d_src, stride * height);
        cudaMalloc(&d_dst, stride * sizeof(Npp32f) * height);
        
        NppiSize imgSize = {width, height};
        nppiConvert_8u32f_C1R_Ctx(d_src, stride, d_dst, stride * sizeof(Npp32f), imgSize, nppCtx);
        cudaStreamSynchronize(stream);
        
        cudaFree(d_src);
        cudaFree(d_dst);
    }
    
    cudaStreamDestroy(stream);
    
    cudaMemGetInfo(&free, &total);
    printMemoryResult("Without thread", width, stride, initialMem, total - free);
}

void testCaseWithThread(int width, int height, int stride, int iterations) {
    size_t free, total;
    cudaMemGetInfo(&free, &total);
    size_t initialMem = total - free;
    
    cudaStream_t stream;
    cudaStreamCreateWithFlags(&stream, cudaStreamDefault);
    
    NppStreamContext nppCtx;
    initNppContext(stream, nppCtx);
    
    for (int i = 0; i < iterations; i++) {
        std::thread t([&]() {
            uint8_t* d_src;
            Npp32f* d_dst;
            cudaMalloc(&d_src, stride * height);
            cudaMalloc(&d_dst, stride * sizeof(Npp32f) * height);
            
            NppiSize imgSize = {width, height};
            nppiConvert_8u32f_C1R_Ctx(d_src, stride, d_dst, stride * sizeof(Npp32f), imgSize, nppCtx);
            cudaStreamSynchronize(stream);
            
            cudaFree(d_src);
            cudaFree(d_dst);
        });
        t.join();
    }
    
    cudaStreamDestroy(stream);
    
    cudaMemGetInfo(&free, &total);
    printMemoryResult("Thread per call", width, stride, initialMem, total - free);
}

int main() {
    const int iterations = 100;

    // No leak: stride % 16 != 0
    std::cout << "=== Case 1: width=32, height=35, stride=36 ===" << std::endl;
    testCaseWithoutThread(32, 35, 36, iterations);
    testCaseWithThread(32, 35, 36, iterations);
    
    // Leak in thread test: width % 16 != 0 && stride % 16 == 0
    std::cout << "=== Case 2: width=31, height=31, stride=32 ===" << std::endl;
    testCaseWithoutThread(31, 31, 32, iterations);
    testCaseWithThread(31, 31, 32, iterations);
    
    return 0;
}

=== Case 1: width=32, height=35, stride=36 ===
Without thread (width=32, stride=36)
  Initial: 192.188 MB
  Final: 192.188 MB
  Increase: 0 MB ✓

Thread per call (width=32, stride=36)
  Initial: 192.188 MB
  Final: 192.188 MB
  Increase: 0 MB ✓

=== Case 2: width=31, height=31, stride=32 ===
Without thread (width=31, stride=32)
  Initial: 192.188 MB
  Final: 192.188 MB
  Increase: 0 MB ✓

Thread per call (width=31, stride=32)
  Initial: 192.188 MB
  Final: 238.188 MB
  Increase: 46 MB ❌

Thread per call (width=32, stride=36)

Thread per call (width=31, stride=32)

Robert_Crovella · December 23, 2025, 7:40pm

I suggest filing a bug.

Yuki_Ni · January 5, 2026, 3:01am

Thanks for filing a ticket . 5766830 is well received and we can replicate the behavior. We will bring back conclusion here when the internal investigation is completed.

Yuki_Ni · February 13, 2026, 1:16pm

Hi dear developer , this is not a memory leak. All device memory are actually released after the app exit/cuda context destroy.

The observed ‘leak’ alike behavior is due to the more streams are not destroyed until the app exits. If you change
cudaStreamCreateWithFlags(&stream, cudaStreamDefault);
to
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
no ‘leak’ will be observed in your case.

I experimented with 10 iteration, in your original case , both with and without thread case are actually using blocking stream alongside the default stream 0/NULL.
Here are how they look like
withthread - the NPP kernels are split onto 11 streams , there are 2 kernels , one is spawned on default stream 14 and the other type of kernel are spawned on a new stream in each iteration .

withoutthread- actually same , still spawned across 2 streams due to NPP kernel uses stream0 and the user codes create blocking stream where runs the other kernel.

When we change to non-blocking stream , the withthread case looks like

your original codes run like

ni@node2:~/yni/Customerbug/5766830$ ./a.out
=== Case 1: width=32, height=35, stride=36 ===
Without thread (width=32, stride=36)
Initial: 99.8125 MB
Final: 99.8125 MB
Increase: 0 MB ✓

Thread per call (width=32, stride=36)
Initial: 99.8125 MB
Final: 99.8125 MB
Increase: 0 MB ✓

=== Case 2: width=31, height=31, stride=32 ===
Without thread (width=31, stride=32)
Initial: 99.8125 MB
Final: 99.8125 MB
Increase: 0 MB ✓

Thread per call (width=31, stride=32)
Initial: 99.8125 MB
Final: 99.8125 MB
Increase: 0 MB ✓

Hope this explains the behavior.
The main difference between cudaStreamCreateWithFlags(&stream, cudaStreamDefault) and cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking) is whether the created stream participates in implicit synchronization with the legacy default stream (stream 0 / NULL stream).

Using cudaStreamDefault (value 0) creates a blocking stream. A blocking stream implicitly synchronizes with the legacy default stream: work in the default stream waits for all previously issued work in blocking streams to finish, and blocking streams also wait for work in the default stream. This can unintentionally serialize kernels and make them appear to run on or be ordered by the default stream instead of behaving as fully independent streams.
Using cudaStreamNonBlocking creates a non‑blocking stream. Non‑blocking streams do not implicitly synchronize with the default stream. Kernels in such a stream can run concurrently with kernels in the default stream and other non‑blocking streams, and there is no hidden ordering between them unless you add explicit synchronization.

In practice, when you want true concurrency and clean separation between your own streams and any library code that may use the default stream (for example, NPP), you should prefer cudaStreamNonBlocking.

woshaoxkf · February 19, 2026, 6:43am

Thank you for the detailed explanation. I understand that using cudaStreamNonBlocking is the better practice in production, and that with cudaStreamDefault, streams accumulate but the memory is properly released upon context destruction.

Beyond just understanding the phenomenon itself, I am very curious about the underlying architecture and the design intent behind this behavior. Gaining insight into these internal workings would be incredibly helpful for my application development.

Could you please clarify the following points?

Implicit Creation: Why are streams (or related resources) implicitly created when using cudaStreamDefault?
Resource Persistence: Why do these created streams/resources persist in memory instead of being released immediately after their execution is complete?
Parameter Dependency: Even when using cudaStreamDefault, why does this internal stream creation occur conditionally based on specific parameters (e.g., width, stride)?

Any guidance or explanation you could provide would be incredibly helpful. Thank you again for your valuable time.

Topic		Replies	Views
NPP behaviour on CUDA streams created with `cudastreamnonblocking` GPU-Accelerated Libraries	1	533	September 14, 2019
nppiCopy_16s_C1R_Ctx in side threads causes memory leak? GPU-Accelerated Libraries	0	534	October 15, 2019
NPP Stream crash GPU-Accelerated Libraries	5	2558	March 21, 2017
NppStreamContext usage for nppi_Ctx functions GPU-Accelerated Libraries	0	499	April 23, 2020
NPP with non-default streams GPU-Accelerated Libraries	7	969	August 8, 2019
NPP & stream problems? GPU-Accelerated Libraries npp	1	1716	October 12, 2021
Npp with multiple Streams GPU-Accelerated Libraries	1	2029	August 31, 2016
NppStreamContext usage for nppi"Name"_Ctx functions CUDA Programming and Performance	0	933	April 22, 2020
VRam Leak when calling npp functions GPU-Accelerated Libraries npp	5	492	December 31, 2023
nppiFilter functions return NPP_CUDA_KERNEL_EXECUTION_ERROR and cause CPU memory leak GPU-Accelerated Libraries	2	1447	October 5, 2016

GPU Memory Leak in nppiConvert_8u32f_C1R_Ctx Function

Related topics