Memory increase too much after cudaMallocPitch

I test cudaMallocPitch/cudaMalloc on Jetson AGX Xavier with code like this:

  void *ptr_dev = nullptr;
  size_t pitch = 0;
  cudaMallocPitch(&ptr_dev, &pitch, 3840, 3 * 2160);

I use jtop to record the memory changes.
Before run the code, the CPU and GPU memory in jtop is 13.7GB/450.6MB.
After run the code, the CPU and GPU memory in jtop is 14.0GB/630.1MB.
CPU memory increased by 0.3GB and GPU memory increased by 179.5MB.

My questions are:

  1. why cudaMallocPitch cause CPU memory increase?
  2. why gpu memory growing much more than the theoretical value?
    (the theoretical value is 3840x2160x3/1024/1024 = 23MB if use cudaMalloc, 4096x2160x3/1024/1024=25.3125MB if use cudaMallocPitch)

The allocation size of cudaMallocPitch is not 3840*3*2160, but pitch*3*2160, where pitch >= 3840.
If I am not mistaken, on Jetson boards the memory is shared between CPU and GPU, so larger GPU memory usage should result in larger reported CPU memory usage.

Yes, the theoretical value is 4096x2160x3/1024/1024=25.3125MB if use cudaMallocPitch. But this value is still much lower than the real value(179.5MB).

Do you know why?

Another question, why is CPU memory usage larger than GPU memory usage? Why aren’t they equal?

CUDA has a lazy initialization system. That means:

  1. CUDA itself uses up GPU memory, in order to make the GPU usable. This is often called “overhead”.
  2. CUDA itself may use “overhead” at any point in your program, because of lazy initialization.

If this particular cudaMallocPitch call is early in your code, you may be witnessing the effect of lazy initialization. It’s not possible to give definitive answers based on a 3 line code snippet.

Furthermore, there are a number of mechanisms in CUDA that may “quantize” or “round-up” an allocation request. In fact, some of these mechanisms may cause a particular allocation to not use up any GPU memory at all.

So expecting a precise-to-the-byte 1:1 correspondence between your allocation requests and memory usage reported by various tools is not realistic.

Regarding CPU memory, the CUDA runtime is an executive that is always running any time you use the GPU for CUDA tasks. This runtime uses up CPU resources. This means that the runtime uses CPU threads, CPU execution cycles, and CPU memory to do its work. It’s not unexpected that some usages of the CUDA runtime (or other entities like CUDA libraries) may use some CPU resources to get their work done.

Detailed explanations of the CUDA runtime library behavior are not provided by NVIDIA.

Regarding the question about why aren’t CPU and GPU memory usage equal, I don’t have any reason to suggest they would be, in light of my above statements. The CUDA runtime does not use CPU resources (e.g. memory) in exactly the same way that it uses GPU resources (e.g. memory) for each and every step of processing that it performs. So I’m unable to answer that question.

I’m answering in the general context of CUDA programming, because that is the forum you have posted on. If you feel that your question needs specific treatment in the context of Jetson AGX Xavier, I suggest asking on the Jetson AGX Xavier forum.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.