Predictable? how much device memory per device context creation.

Do we have a way to predicate how much device memory is consumed per device context creation at least?

Below is the output of nvidia-smi in my environment.
The first process create a device context but has no actual workload, thus it never called cuMemAlloc().
However, it consumes 75MB on the device side.

Other processes are workers. Every worker allocated 171MB using cuMemAlloc().
However, they consumed 352MB on the device side for each.

My question is:

  1. Do we have a way to predicate how much device memory is consumed per device context creation?

  2. Do we have a way to predicate how much device memory is “actually” consumed on cuMemAlloc()?

  3. Do we have a way to reduce these extra usage of device memory?

If you have ideas, please share with us.

Thanks,

[kaigai@saba ~]$ nvidia-smi
Thu Mar 24 09:47:29 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.79     Driver Version: 352.79         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980     Off  | 0000:84:00.0     Off |                  N/A |
| 33%   33C    P2    40W / 180W |   3982MiB /  4095MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     23504    C   /usr/local/pgsql/bin/postgres                   75MiB |
|    0     24724    C   postgres: kaigai postgres [local] idle         352MiB |
|    0     24740    C   postgres: kaigai postgres [local] idle         352MiB |
|    0     24747    C   postgres: kaigai postgres [local] idle         352MiB |
|    0     24753    C   postgres: kaigai postgres [local] idle         352MiB |
|    0     24760    C   postgres: kaigai postgres [local] idle         352MiB |
|    0     24767    C   postgres: kaigai postgres [local] idle         352MiB |
|    0     24773    C   postgres: kaigai postgres [local] idle         352MiB |
|    0     24778    C   postgres: kaigai postgres [local] idle         352MiB |
|    0     24783    C   postgres: kaigai postgres [local] idle         352MiB |
|    0     24788    C   postgres: kaigai postgres [local] idle         352MiB |
|    0     24802    C   postgres: kaigai postgres [local] SELECT       353MiB |
+-----------------------------------------------------------------------------+

I’m not sure about the initial cost of context creation/overhead. But once these are instantiated, the differential cost to allocate more memory of reasonable size in one allocation should be approximately the size of that allocation. For example, in your “worker” processes which are currently using ~350MB, if you then allocate an additional 100MB, you should see that process use about ~450MB instead.

Run everything in a single process. This has additional benefits from a scheduling and latency perspective, when sharing a GPU like this. You might also try experimenting with CUDA-MPS if you are not already using it, but I don’t know if it would impact memory usage in this case:

http://stackoverflow.com/questions/34709749/how-do-i-use-nvidia-multi-process-service-mps-to-run-multiple-non-mpi-cuda-app

That’s quite a lot of memory usage. As far as I can remember, first time I read about this ! ;) :)

Any idea what all that memory is used for ? Perhaps some kind of CUDA libraries ?

Unfortunately, this option is not possible to adopt, because PostgreSQL uses multi-process model and my software works as an extension of PostgreSQL. It is not reasonable choice to rewrite entire PostgreSQL for multi-threading model.

Thanks for your information. I didn’t know about CUDA-MPS.
However, here are two unignorable limitations for me.

  • Dynamic parallelism is not supported.
  • Stream callbacks are not supported.

If this proxy-like architecture could solve the implicit device memory consumption problem, it is an idea to implement own proxy feature that proxies GPU kernel call from other worker processes.

However, result of my small test implies this approach will not solve the problem, because multiple GPU context in a process also consumes same amount of device memory when individual processes made a GPU context per context.

It is not an option to share a particular GPU context by multiple workers because a device error by a particular worker process will break on-device status of other worker processes.
I like to know what is the reason of this not small device memory consumption…

#include <stdio.h>
#include <cuda.h>

int main(int argc, const char *argv[])
{
        CUdevice        device;
        CUcontext       context;
        CUdeviceptr     dptr;
        int                     i, status;

        cuInit(0);
        cuDeviceGet(&device,0);

        while (cuCtxCreate(&context, 0, device) == CUDA_SUCCESS)
        {
                printf("create one GPU context\n");
                sleep(2);
        }
        printf("no more GPU context\n");
        return 0;
}
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     59879    C   ./a.out                                       3775MiB |
+-----------------------------------------------------------------------------+
Mon Mar 28 15:50:43 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.79     Driver Version: 352.79         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980     Off  | 0000:02:00.0     Off |                  N/A |
| 33%   36C    P2    40W / 180W |   3972MiB /  4095MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20c          Off  | 0000:04:00.0     Off |                    0 |
| 30%   35C    P8    27W / 225W |     12MiB /  4799MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     59879    C   ./a.out                                       3929MiB |
+-----------------------------------------------------------------------------+

The first example (case of postgresql extension) linked libcudadevrt.a, however, the small program above never linked any CUDA libraries. It just called cuCxtCreate() via driver API, however, each context consumed about 60MB-70MB for each.

That sounds about right. When I had to size this for an automated regression test framework that needed to work across numerous different devices and CUDA versions and ran across a pool of hundreds of systems, I conservatively assumed 90 MB per context (when using the runtime, not the driver interface). I do not recall ever running into trouble with that estimate. CUDA context memory consumption has been quite stable for years, so unlikely to either decrease or increase drastically.

My recommendation is to accept this “context overhead” as a fact of life, and simply deploy GPUs with sufficiently large memory. Devices with 4 GB of memory are fairly affordable these days.

Thanks, I’m inclined to agree with this.

It looks to me a better design is to launch a CUDA proxy server and create a certain number of CUDA contexts preliminary, then attach each worker with a particular CUDA context on demand at the proxy server side, not creating CUDA context individual workers.

It will make error handling a bit complicated, but we can also expect performance and concurrency benefit.