Huge OpenCL memory overhead, is it normal?

forgive me for posting my question here, I could not find an OpenCL forum on nvidia’s website.

my question was posted here previously, but nobody answered.

https://forums.khronos.org/showthread.php/13602-clCreateContextFromType-and-clCreateCommandQueue-consumes-1GB-host-memory-per-device

basically, I have a Linux box with 11x 1080Ti, my cuda version of code works fine, no problem using all 11 GPUs for my simulation. However, when running the OpenCL version, my code can not use more than 8 GPUs. tried both cuda 7.5 and 8, no difference.

it turns out the crash was caused by clCreateContextFromType and clCreateCommandQueue. for each GPU, these calls consume 1GB memory and run really slow (2-3 second per device).

this wasn’t an issue until recently we updated the code to support multiple nvidia GPUs in opencl. we had to create multiple contexts because nvidia driver does not run multiple kernels in the queue in parallel (have to say, this is very annoying).

does this sound normal? is there a workaround? thanks

I have also found an unexpectedly large virtual memory use of an OpenCL application I wrote, to the point where I had to increase the system’s vm.overcommit_ratio from 50 to 300 so the application would even run. This might be a workaround for you too.

I believe the Unified memory system of the CUDA drivers to be responsible, consuming large memory space. A 32 bit application would likely not cause this driver behavior, but I am not sure if nVidia still supports 32 bit applications in OpenCL on 64 bit host systems. Also the 4GB per process memory limit might be too restrictive for many uses.

Christian

thanks Christian. I tried to increase overcommit_ratio as you suggested, program still get killed when requesting more than 8 devices. I used this command in my Ubuntu 14.04 box

sudo sysctl -w vm.overcommit_ratio=300

then I watched the virtual memory grew from few MB to 7.6GB, then the process is killed.

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                                                                     
12476 fangq     20   0  0.457t 7.649g 6.598g R 102.1 49.0   0:35.31 mcxcl

I even tried to increase the ratio to 3000, no difference.

any other workaround?

Yikes! 0.457 terabytes of address space committed.

when you look at /proc/meminfo, what is your CommitLimit ?

The VIRT column shown above cannot exceed the CommitLimit, or you will run into a failed memory allocation.

The vm.overcommit_ratio allows to adjust the ratio of maximum committed address space (CommitLimit) in relation to the sum of physical memory + swap, as far as I remember. But it depends on other settings too, such as vm.overcommit_memory

you could also play with the vm.overcommit_memory=1 setting, as documented here

https://www.kernel.org/doc/Documentation/vm/overcommit-accounting

Christian