Memory usage values in nvidia-smi command

Hi, I’m studying “Unified Memory” with CUDA programming now.
In the tutorial of link below, I got some questions about memory usage of gpus.

At the First sample code(search w/, the variables in the code consist x and y which are float pointer to make float array.
Refer to code, the size of x and y is same with N(1<<20 == 1M) so that their total size should be 4MB for the array x and 4MB for the array and it uses total 8MB memory space of host side.
If it is true, I can’t understand why this memory Usage printed on nvidia-smi command.(I know that it is not accurate information but it’s true nvidia-smi command checks the current features of gpu naively. Isn’t it?)


First question is why the memory usage at the middle is printed as 522MiB and GPU memory Usage at the right bottom corner is printed as 384MiB. Even if this code makes huge amount of middle values, we use just 8MB for input data. I can’t understand why gpu uses over 380MiB to process.

Second questions is why the GPU Memory Usage value at right bottom corner does not change when the input data size is varied. I changed the variable N to 2<<30. But it was not changed to any values. Only the memory usage in the middle of picture changed.

– environment –
OS: Ubuntu 20.04
CUDA: 12.2
GPU: RTX 4090

Running a CUDA code on a GPU where the code itself requires 8MB of data space will certainly require more than 8MB of the GPU memory. This goes into various overheads. The first number is the total GPU memory in use. Roughly speaking, that consists of memory just to make the GPU active and able to accept a CUDA process, and also memory associated with the process. So the first number is all the memory in use, and the second number is the memory in use that is specifically associated with the process numbered 373723.

Most allocations have granularity. So one possible reason is that you have not exceeded a particular allocation granularity. That’s not likely the answer here. Another possibility is that the unified memory system may affect the reporting. Unified memory on linux does not necessarily allocate all expected device memory space at once.

Thanks for reply.

Then, I have a question with your first/second reply each.

For first reply,
How to check the overheads and how to get the breakdown data of it? Could Nsight Systems and Nsight Compute find all of the overheads?

For second reply,
For the second possibility, is there any suggestion of “IDEAL” environment to perform and report(also profile) the CUDA kernels using Unified memory? Because I’m currently setting the container, if there are some solutions, I want to follow it.

I’m not aware of any accounting like that, available anywhere.

I’m not aware of anything about your setup that is non-“IDEAL”.