Hi all,
I am developing a vision system on TX2 platform. And I have a long story of memory management problems :-)
The total amount of memory (8GB) is hardly enough for running the code base we have, although the GPU/CPU allocations from the code itself rarely exceed ~2GB in total.
After the latest investigation, I just found that the consumption of host memory seems to increase abnormally when using some cuda libraries cuda 10.2 + OpenCV 3.4.4.
For this example (just to demonstrate the problem), I use opencv+cuda compiled in this system.
Please consider this code:
int main(int argc, char** argv) {
void* ptr1;
ptr1 = malloc(1000*1000*1); // Only to trigger gprof
printf("Hello World from CPU! %x\n", ptr1);
cv::cuda::GpuMat mat(1000, 1000, CV_32SC1);
cv::cuda::GpuMat minMaxVals, minMaxLocs;
cv::cuda::findMinMaxLoc(mat, minMaxVals, minMaxLocs);
sleep(10);
return 0;
}
When running pmap -x
, the Linux kernel reports this process uses 140MB RES memory, when the largest contributors are:
Address Kbytes RSS Dirty Mode Mapping
0000005562d44000 82872 **82276** 82276 rw--- [ anon ]
0000007f97b43000 30296 **22708** 0 r-x-- libopencv_cudaarithm.so.3.4.4
---------------- ------- ------- -------
total kB 16688008 141172 100988
After running the gprof to find who should be blamed for 140MB residential allocations in this tiny code, I got this picture:
cuEGLApiInit 82 MB (!), from them 77MB are called from cudart - contextState - loadCubin().
- The gprof does not take into account the mmap, only malloc().
This raises the question:
Does this mean that cudart on my system reads cubins from the .so files, unzip them, and then keeps them in the allocated host memory during the run time of the process? Probably these should be memmapped from a file on the disk to use the Linux virtual memory model, just like shared libraries work. What could be the cause of this problem I have?
Another question arises from the cudaMalloc(). As I see it (from experiments with pmap), the amount of memory allocated using cudaMalloc in no way appears in the pmap output, and this probably means that the linux kernel is not aware of this amount and is unable to properly calculate the oom_score, isn’t it?
Attached are the pmap output and hprof output.
I would be very grateful for any suggestions and advice.
hello.pmap.txt (19.7 KB)