Why is ~300 MiB of GPU RAM used by "nothing"?

If I execute my program and only call cudaDeviceSynchronize, for example (and use while(true); to “pause”), nvidia-smi reports ~300 MiB of memory used, even though I’ve made no other calls or allocations. GPU RAM is so precious - why is this happening?

GPU RAM is used by the GPU OS (i.e. the CUDA driver and CUDA runtime) for many of the same purposes that your system OS consumes system RAM. I don’t have a detailed breakdown however.

Fair enough… but I don’t like it! :)

Interesting. I wonder whether this is a Windows thing or whether the same overhead can be observed on Linux. Until a few years ago, the size of a GPU context used to be on the order of 100 MB. But checking on my Windows 7 system with the latest CUDA driver, I see nvidia-smi reports 292 MB in use, i.e. pretty much what you are seeing.

Well I’m on Ubuntu, so there you go.

Hey, at least the behavior is consistent across platforms :-) In the best of all worlds, this considerable expansion of the GPU context came with benefits for CUDA users (i.e. it was used for some technical benefits), although I don’t know what those are. Maybe it was a classical performance vs. space trade-off that seemed justified given the expansion of average GPU memory size in recent years (even lower-end consumer models come with 4 GB these days).

Good points, though from the perspective of HPC (mine), there hasn’t been much expansion in the Tesla line in terms of RAM lately, and 16GB is a really painful restriction (weren’t 32GB cards supposed to come out a long time ago?)… though, I think you actually once commented to me on here that doing something like a slab decomposition on the host and using asynchronous transfer and kernel execution could get pretty near-optimal performance… just sounds like such a pain compared to, say, not having to do so ;P

My understanding (which is limited) is that increases in GPU memory capacity at the high-end of the GPU range are directly dependent on higher capacity DRAM chips, which do not exist yet. With Moore’s Law on its deathbed, we may need to wait a couple more years until they materialize. The Quadro P6000 offers 24 GB of GDDR5X, I am not aware of a GPU with larger memory than that.

I thought I had read something recently about a road map for higher-capacity DRAM chips, but I can’t find anything useful right now, so maybe that is a false memory.

I may or may not have made the statement attributed to me about slab decomposition. I don’t have a specific recollection, but I seem to recall a GTC presentation from about four to five years ago about efficient out-of-core solvers using the GPU, maybe that was the context?

Right - except the K80, in the sense that it’s 2x12GB. (I only care about FP64).

Yes - but regardless, there’s no other option when you’ve exceeded the size of a single chip or node, no matter what the platform. It’ll get done some day…