Very high host memory consumption by cuda libraries (opencv, cufft, etc.)

Hi all,

I am developing a vision system on TX2 platform. And I have a long story of memory management problems :-)
The total amount of memory (8GB) is hardly enough for running the code base we have, although the GPU/CPU allocations from the code itself rarely exceed ~2GB in total.
After the latest investigation, I just found that the consumption of host memory seems to increase abnormally when using some cuda libraries cuda 10.2 + OpenCV 3.4.4.

For this example (just to demonstrate the problem), I use opencv+cuda compiled in this system.
Please consider this code:

int main(int argc, char** argv) {
    void* ptr1;
    ptr1 = malloc(1000*1000*1); // Only to trigger gprof
    printf("Hello World from CPU! %x\n", ptr1);

    cv::cuda::GpuMat mat(1000, 1000, CV_32SC1);

    cv::cuda::GpuMat minMaxVals, minMaxLocs;

    cv::cuda::findMinMaxLoc(mat, minMaxVals, minMaxLocs);

    sleep(10);

    return 0;
}

When running pmap -x, the Linux kernel reports this process uses 140MB RES memory, when the largest contributors are:

Address           Kbytes     RSS   Dirty Mode  Mapping

0000005562d44000   82872   **82276**   82276 rw---   [ anon ]
0000007f97b43000   30296   **22708**       0 r-x-- libopencv_cudaarithm.so.3.4.4

---------------- ------- ------- ------- 
total kB         16688008  141172  100988

After running the gprof to find who should be blamed for 140MB residential allocations in this tiny code, I got this picture:
cuEGLApiInit 82 MB (!), from them 77MB are called from cudart - contextState - loadCubin().

  • The gprof does not take into account the mmap, only malloc().

This raises the question:
Does this mean that cudart on my system reads cubins from the .so files, unzip them, and then keeps them in the allocated host memory during the run time of the process? Probably these should be memmapped from a file on the disk to use the Linux virtual memory model, just like shared libraries work. What could be the cause of this problem I have?
Another question arises from the cudaMalloc(). As I see it (from experiments with pmap), the amount of memory allocated using cudaMalloc in no way appears in the pmap output, and this probably means that the linux kernel is not aware of this amount and is unable to properly calculate the oom_score, isn’t it?

Attached are the pmap output and hprof output.
I would be very grateful for any suggestions and advice.


hello.pmap.txt (19.7 KB)

Hi,

This is a known problem.
CUDA will load the whole library to the memory when initialization.

This is improved in CUDA 11.8 which has the lazy loading feature.
Unfortunately, CUDA 11.8 cannot support TX2.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.